I want to slice each row of my dataframe into 3 windows with slice indices that are stored in another dataframe and change for each row of the dataframe. Afterwards i want to return a single dataframe containing the windows in form of a MultiIndex. The rows in each windows that are shorter than the longest row in the window should be filled with NaN values.
Since my actual dataframe has around 100.000 rows and 600 columns, i am concerned about an efficient solution.
Consider the following example:
This is my dataframe which i want to slice into 3 windows
>>> df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
And the second dataframe containing my slicing indices having the same count of rows as df:
>>> df_slice
0 1
0 3 5
1 2 6
2 4 7
I've tried slicing the windows, like so:
first_window = df.iloc[:, :df_slice.iloc[:, 0]]
first_window.columns = pd.MultiIndex.from_tuples([("A", c) for c in first_window.columns])
second_window = df.iloc[:, df_slice.iloc[:, 0] : df_slice.iloc[:, 1]]
second_window.columns = pd.MultiIndex.from_tuples([("B", c) for c in second_window.columns])
third_window = df.iloc[:, df_slice.iloc[:, 1]:]
third_window.columns = pd.MultiIndex.from_tuples([("C", c) for c in third_window.columns])
result = pd.concat([first_window,
second_window,
third_window], axis=1)
Which gives me the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 3
1 2
2 4
Name: 0, dtype: int64] of <class 'pandas.core.series.Series'>
My expected output is something like this:
>>> result
A B C
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 NaN 3 4 NaN NaN 5 6 7
1 8 9 NaN NaN 10 11 12 13 14 15 NaN
2 16 17 18 19 20 21 22 NaN 23 NaN NaN
Is there an efficient solution for my problem without iterating over each row of my dataframe?
Here's a solution which, using melt and then pivot_table, plus some logic to:
Identify the three groups 'A', 'B', and 'C'.
Shift the columns to the left, so that NaN would only appear at the right side of each window.
Rename columns to get the expected output.
t = df.reset_index().melt(id_vars="index")
t = pd.merge(t, df_slice, left_on="index", right_index=True)
t.variable = pd.to_numeric(t.variable)
t.loc[t.variable < t.c_0,"group"] = "A"
t.loc[(t.variable >= t.c_0) & (t.variable < t.c_1), "group"] = "B"
t.loc[t.variable >= t.c_1, "group"] = "C"
# shift relevant values to the left
shift_val = t.groupby(["group", "index"]).variable.transform("min") - t.groupby(["group"]).variable.transform("min")
t.variable = t.variable - shift_val
# extract a, b, and c groups, and create a multi-level index for their
# columns
df_a = pd.pivot_table(t[t.group == "A"], index= "index", columns="variable", values="value")
df_a.columns = pd.MultiIndex.from_product([["a"], df_a.columns])
df_b = pd.pivot_table(t[t.group == "B"], index= "index", columns="variable", values="value")
df_b.columns = pd.MultiIndex.from_product([["b"], df_b.columns])
df_c = pd.pivot_table(t[t.group == "C"], index= "index", columns="variable", values="value")
df_c.columns = pd.MultiIndex.from_product([["c"], df_c.columns])
res = pd.concat([df_a, df_b, df_c], axis=1)
res.columns = pd.MultiIndex.from_tuples([(c[0], i) for i, c in enumerate(res.columns)])
print(res)
The output is:
a b c
0 1 2 3 4 5 6 7 8 9 10
index
0 0.0 1.0 2.0 NaN 3.0 4.0 NaN NaN 5.0 6.0 7.0
1 8.0 9.0 NaN NaN 10.0 11.0 12.0 13.0 14.0 15.0 NaN
2 16.0 17.0 18.0 19.0 20.0 21.0 22.0 NaN 23.0 NaN NaN
Related
DataFrame:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
Is it possible to drop values from index 2 to 4 in column B? or replace it with NaN.
In this case, values: [8, 9, 10] should be removed.
I tried this: df.drop(columns=['B'], index=[8, 9, 10]), but then column B is removed.
Drop values does not make sense into DataFrame. You can set values to NaN instead and use .loc / .iloc to access index/columns:
>>> df
A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
e 5 10 15
# By name:
df.loc['c':'e', 'B'] = np.nan
# By number:
df.iloc[2:5, 2] = np.nan
Read carefully Indexing and selecting data
import pandas as pd
data = [
['A','B','C'],
[1,6,11],
[2,7,12],
[3,8,13],
[4,9,14],
[5,10,15]
]
df = pd.DataFrame(data=data[1:], columns=data[0])
df['B'] = df['B'].shift(3)
>>>
A B C
0 1 NaN 11
1 2 NaN 12
2 3 NaN 13
3 4 6.0 14
4 5 7.0 15
I am dealing with pandas DataFrames like this:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 NaN
5 2 NaN
6 1 300
7 1 NaN
I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 20
5 2 200
6 1 300
7 1 300
Is there some slick way to do this without manually looping over rows?
You could perform a groupby/forward-fill operation on each group:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)
yields
id x
0 1 10.0
1 1 20.0
2 2 100.0
3 2 200.0
4 1 20.0
5 2 200.0
6 1 300.0
7 1 300.0
df
id val
0 1 23.0
1 1 NaN
2 1 NaN
3 2 NaN
4 2 34.0
5 2 NaN
6 3 2.0
7 3 NaN
8 3 NaN
df.sort_values(['id','val']).groupby('id').ffill()
id val
0 1 23.0
1 1 23.0
2 1 23.0
4 2 34.0
3 2 34.0
5 2 34.0
6 3 2.0
7 3 2.0
8 3 2.0
use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.
Solution for multi-key problem:
In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.
import os
import pandas as pd
#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)
#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
for t in types:
group_df = df[(df.region == r) & (df.type == t)].copy()
group_df.fillna(method='ffill', inplace=True)
group_df.fillna(method='bfill', inplace=True)
group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)
Checking the result:
#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()
#compare new and old dataframe
print(df.shape)
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())
I am appending three CSVs:
df = pd.read_csv("places_1.csv")
temp = pd.read_csv("places_2.csv")
df = df.append(temp)
temp = pd.read_csv("places_3.csv")
df = df.append(temp)
print(df.head(20))
the joined table looks like:
location device_count population
0 A 11 NaN
1 B 12 NaN
2 C 13 NaN
3 D 14 NaN
4 E 15 NaN
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
As you can see the indices are not unique.
When I call this iloc function to multiply the population column by 2:
df2 = df.copy
for index, row in df.iterrows():
df.iloc[index, df.columns.get_loc('population')] = row['device_count'] * 2
I get the following erronious result:
location device_count population
0 A 11 62.0
1 B 12 64.0
2 C 13 66.0
3 D 14 68.0
4 E 15 70.0
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
For each CSV it is overwriting the indexes of the first CSV
I have also tried creating a new column of integers and calling df.set_index(). That did not work.
Any tips?
First, use ignore_index, second, don't use append, use pd.concat([temp1, temp2, temp3], ignore_index=True).
As others have stated, you can use ignore_index, and you probably should use pd.concat here. Alternatively, for other situations where you are not combining DataFrames, you can also use df = df.reset_index(drop=True) to change the indices after the fact.
Additionally, you should avoid using iterrows() for reasons listed in the docs here. Using the following works way better:
df.loc[:, 'population'] = df.loc[:, 'device_count'].astype('int') * 2
I want to add a list as a column to the df dataframe. The list has a different size than the column length.
df =
A B C
1 2 3
5 6 9
4
6 6
8 4
2 3
4
6 6
8 4
D = [11,17,18]
I want the following output
df =
A B C D
1 2 3 11
5 6 9 17
4 18
6 6
8 4
2 3
4
6 6
8 4
I am doing the following to extend the list to the size of the dataframe by adding "nan"
# number of nan value require for the list to match the size of the column
extend_length = df.shape[0]-len(D)
# extend the list
D.extend(extend_length * ['nan'])
# add to the dataframe
df["D"] = D
A B C D
1 2 3 11
5 6 9 17
4 18
6 6 nan
8 4 nan
2 3 nan
4 nan
6 6 nan
8 4 nan
Where "nan" is treated like string but I want it to be empty ot "nan", thus, if I search for number of valid cell in D column it will provide output of 3.
Adding the list as a Series will handle this directly.
D = [11,17,18]
df.loc[:, 'D'] = pd.Series(D)
A simple pd.concat on df and series of D as follows:
pd.concat([df, pd.Series(D, name='D')], axis=1)
or
df.assign(D=pd.Series(D))
Out[654]:
A B C D
0 1 2.0 3.0 11.0
1 5 6.0 9.0 17.0
2 4 NaN NaN 18.0
3 6 NaN 6.0 NaN
4 8 NaN 4.0 NaN
5 2 NaN 3.0 NaN
6 4 NaN NaN NaN
7 6 NaN 6.0 NaN
8 8 NaN 4.0 NaN
I am trying to join a fragment of a dataframe with another one. The structure of the dataframe to join is simplified below:
left:
ID f1 TIME
1 10 1
3 10 1
7 10 1
9 10 2
2 10 2
1 10 2
3 10 2
right:
ID f2 f3
1 0 11
7 9 11
I need to select the left dataset by time, and I need to attached the right one, the result I would like to have is the following:
left:
ID f1 TIME f2 f3
1 10 1 0 11
3 10 1 nan nan
7 10 1 9 11
9 10 2 nan nan
2 10 2 nan nan
1 10 2 nan nan
3 10 2 nan nan
Currently I am usually joining dataframes in this way:
left = left.join(right.set_index('ID'), on='ID')
In this case I am using:
left[left.TIME == 1] = left[left.TIME == 1].join(right.set_index('ID'), on='ID')
I have also tried with merge, but the result is the left dataframe without any of the other columns.
Finally the structure of my script need to do this for every unique TIME in the dataframe, thus:
for t in numpy.unique(left.TIME):
#do join on the fragment left.TIME == t
If I save the returned value from the join function in a new dataframe everything works fine, but trying to add the value at the left dataframe does not work.
EDIT: The IDs of the left dataset can be present multiple times, but not inside the same TIME value.
You can filter first by boolean indexing, merge and concat last:
df1 = left[left['TIME']==1]
#alternative
#df1 = left.query('TIME == 1')
df2 = left[left['TIME']!=1]
#alternative
#df2 = left.query('TIME != 1')
df = pd.concat([df1.merge(right, how='left'), df2])
print (df)
ID TIME f1 f2 f3
0 1 1 10 0.0 11.0
1 3 1 10 NaN NaN
2 7 1 10 9.0 11.0
3 9 2 10 NaN NaN
4 2 2 10 NaN NaN
5 1 2 10 NaN NaN
6 3 2 10 NaN NaN
EDIT: merge create default indices, so possible solution is create column first and then set to index:
print (left)
ID f1 TIME
10 1 10 1
11 3 10 1
12 7 10 1
13 9 10 2
14 2 10 2
15 1 10 2
16 3 10 2
#df = left.merge(right, how='left')
df1 = left[left['TIME']==1]
df2 = left[left['TIME']!=1]
df = pd.concat([df1.reset_index().merge(right, how='left').set_index('index'), df2])
print (df)
ID TIME f1 f2 f3
10 1 1 10 0.0 11.0
11 3 1 10 NaN NaN
12 7 1 10 9.0 11.0
13 9 2 10 NaN NaN
14 2 2 10 NaN NaN
15 1 2 10 NaN NaN
16 3 2 10 NaN NaN
EDIT:
After discussion after modify input data is possible use:
df = left.merge(right, how='left', on=['ID','TIME'])
This is one way:
res = left.drop_duplicates('ID')\
.merge(right, how='left')\
.append(left[left.duplicated(subset=['ID'])])
# ID TIME f1 f2 f3
# 0 1 1 10 0.0 11.0
# 1 3 1 10 NaN NaN
# 2 7 1 10 9.0 11.0
# 3 9 2 10 NaN NaN
# 4 2 2 10 NaN NaN
# 5 1 2 10 NaN NaN
# 6 3 2 10 NaN NaN
Note that columns f2 and f3 become float since NaN is considered a float.