Fixing Indexing when Appending Dataframes - python

I am appending three CSVs:
df = pd.read_csv("places_1.csv")
temp = pd.read_csv("places_2.csv")
df = df.append(temp)
temp = pd.read_csv("places_3.csv")
df = df.append(temp)
print(df.head(20))
the joined table looks like:
location device_count population
0 A 11 NaN
1 B 12 NaN
2 C 13 NaN
3 D 14 NaN
4 E 15 NaN
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
As you can see the indices are not unique.
When I call this iloc function to multiply the population column by 2:
df2 = df.copy
for index, row in df.iterrows():
df.iloc[index, df.columns.get_loc('population')] = row['device_count'] * 2
I get the following erronious result:
location device_count population
0 A 11 62.0
1 B 12 64.0
2 C 13 66.0
3 D 14 68.0
4 E 15 70.0
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
For each CSV it is overwriting the indexes of the first CSV
I have also tried creating a new column of integers and calling df.set_index(). That did not work.
Any tips?

First, use ignore_index, second, don't use append, use pd.concat([temp1, temp2, temp3], ignore_index=True).

As others have stated, you can use ignore_index, and you probably should use pd.concat here. Alternatively, for other situations where you are not combining DataFrames, you can also use df = df.reset_index(drop=True) to change the indices after the fact.
Additionally, you should avoid using iterrows() for reasons listed in the docs here. Using the following works way better:
df.loc[:, 'population'] = df.loc[:, 'device_count'].astype('int') * 2

Related

Slicing each dataframe row into 3 windows with different slicing ranges

I want to slice each row of my dataframe into 3 windows with slice indices that are stored in another dataframe and change for each row of the dataframe. Afterwards i want to return a single dataframe containing the windows in form of a MultiIndex. The rows in each windows that are shorter than the longest row in the window should be filled with NaN values.
Since my actual dataframe has around 100.000 rows and 600 columns, i am concerned about an efficient solution.
Consider the following example:
This is my dataframe which i want to slice into 3 windows
>>> df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
And the second dataframe containing my slicing indices having the same count of rows as df:
>>> df_slice
0 1
0 3 5
1 2 6
2 4 7
I've tried slicing the windows, like so:
first_window = df.iloc[:, :df_slice.iloc[:, 0]]
first_window.columns = pd.MultiIndex.from_tuples([("A", c) for c in first_window.columns])
second_window = df.iloc[:, df_slice.iloc[:, 0] : df_slice.iloc[:, 1]]
second_window.columns = pd.MultiIndex.from_tuples([("B", c) for c in second_window.columns])
third_window = df.iloc[:, df_slice.iloc[:, 1]:]
third_window.columns = pd.MultiIndex.from_tuples([("C", c) for c in third_window.columns])
result = pd.concat([first_window,
second_window,
third_window], axis=1)
Which gives me the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 3
1 2
2 4
Name: 0, dtype: int64] of <class 'pandas.core.series.Series'>
My expected output is something like this:
>>> result
A B C
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 NaN 3 4 NaN NaN 5 6 7
1 8 9 NaN NaN 10 11 12 13 14 15 NaN
2 16 17 18 19 20 21 22 NaN 23 NaN NaN
Is there an efficient solution for my problem without iterating over each row of my dataframe?
Here's a solution which, using melt and then pivot_table, plus some logic to:
Identify the three groups 'A', 'B', and 'C'.
Shift the columns to the left, so that NaN would only appear at the right side of each window.
Rename columns to get the expected output.
t = df.reset_index().melt(id_vars="index")
t = pd.merge(t, df_slice, left_on="index", right_index=True)
t.variable = pd.to_numeric(t.variable)
t.loc[t.variable < t.c_0,"group"] = "A"
t.loc[(t.variable >= t.c_0) & (t.variable < t.c_1), "group"] = "B"
t.loc[t.variable >= t.c_1, "group"] = "C"
# shift relevant values to the left
shift_val = t.groupby(["group", "index"]).variable.transform("min") - t.groupby(["group"]).variable.transform("min")
t.variable = t.variable - shift_val
# extract a, b, and c groups, and create a multi-level index for their
# columns
df_a = pd.pivot_table(t[t.group == "A"], index= "index", columns="variable", values="value")
df_a.columns = pd.MultiIndex.from_product([["a"], df_a.columns])
df_b = pd.pivot_table(t[t.group == "B"], index= "index", columns="variable", values="value")
df_b.columns = pd.MultiIndex.from_product([["b"], df_b.columns])
df_c = pd.pivot_table(t[t.group == "C"], index= "index", columns="variable", values="value")
df_c.columns = pd.MultiIndex.from_product([["c"], df_c.columns])
res = pd.concat([df_a, df_b, df_c], axis=1)
res.columns = pd.MultiIndex.from_tuples([(c[0], i) for i, c in enumerate(res.columns)])
print(res)
The output is:
a b c
0 1 2 3 4 5 6 7 8 9 10
index
0 0.0 1.0 2.0 NaN 3.0 4.0 NaN NaN 5.0 6.0 7.0
1 8.0 9.0 NaN NaN 10.0 11.0 12.0 13.0 14.0 15.0 NaN
2 16.0 17.0 18.0 19.0 20.0 21.0 22.0 NaN 23.0 NaN NaN

Map values from one dataframe to new columns in other based on column values - Pandas

I have a problem with mapping values from another dataframe.
These are samples of two dataframes:
df1
product class_1 class_2 class_3
141A 11 13 5
53F4 12 11 18
GS24 14 12 10
df2
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3
1 141A GS24 NaN NaN 1 3 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN
What I'm trying to get is next:
I need to add a new columns called "Max_Class_1", "Max_Class_2", "Max_Class_3" and that value would be taken from df1.
For each order number (_1, _2, _3) look at existing columns (for example product_type_1) product_type_1 and take a row from df1 where the product has the same value. Then look at the measure columns (for example measure_1) and if the value is 1 (it's possible max four different values in original data), new column called "Max_Class_1" would have value same as class_1 for that product_type, in this case 11.
I think it's a little bit simpler than I explained it.
Desired output
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3 max_class_0 max_class_1 max_class_2 max_class_3
1 141A GS24 NaN NaN 1 3 NaN NaN 1 10 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN 12 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN 11 13 11 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN 5 12 NaN NaN
The code I have tried with:
df2['max_class_1'] = None
df2['max_class_2'] = None
df2['max_class_3'] = None
def get_max_class(product_df, measure_df, product_type_column, measure_column, max_class_columns):
for index, row in measure_df.iterrows():
product_df_new = product_df[product_df['product'] == row[product_type_column]]
for ind, r in product_df_new.iterrows():
if row[measure_column] == 1:
row[max_class_columns] = r['class_1']
elif row[measure_column] == 2:
row[max_class_columns] = r['class_2']
elif row[measure_column] == 3:
row[max_class_columns] = r['class_3']
else:
row[tilt_column] = "There is no measure or type"
return measure_df
# And the function call
first_class = get_max_class(product_df=df1, measure_df=df2, product_type_column=product_type_1, measure_column='measure_1', max_class_columns='max_class_1')
second_class = get_max_class(product_df=df1, measure_df=first_class, product_type_column=product_type_2, measure_column='measure_2', max_class_columns='max_class_2')
third_class = get_max_class(product_df=df1, measure_df=second_class, product_type_column=product_type_3, measure_column='measure_3', max_class_columns='max_class_3')
I'm pretty sure there is a simpler solution, but don't know why is not working. I'm getting all None values, nothing changes.
pd.DataFrame.lookup is the standard method for lookups by row and column labels.
Your problem is complicated by the existence of null values. But this can be accommodated by modifying your input mapping dataframe.
Step 1
Rename columns in df1 to integers and add an extra row / column. We will use the added data later to deal with null values.
def rename_cols(x):
return x if not x.startswith('class') else int(x.split('_')[-1])
df1 = df1.rename(columns=rename_cols)
df1 = df1.set_index('product')
df1.loc['X'] = 0
df1[0] = 0
Your mapping dataframe now looks like:
print(df1)
1 2 3 0
product
141A 11 13 5 0
53F4 12 11 18 0
GS24 14 12 10 0
X 0 0 0 0
Step 2
Iterate the number of categories and use pd.DataFrame.lookup. Notice how we fillna with X and 0, exactly what we used for additional mapping data in Step 1.
n = df2.columns.str.startswith('measure').sum()
for i in range(n):
rows = df2['product_type_{}'.format(i)].fillna('X')
cols = df2['measure_{}'.format(i)].fillna(0).astype(int)
df2['max_{}'.format(i)] = df1.lookup(rows, cols)
Result
print(df2)
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 \
0 1 141A GS24 NaN NaN 1
1 2 53F4 NaN NaN NaN 1
2 3 53F4 141A 141A NaN 2
3 4 141A GS24 NaN NaN 3
measure_1 measure_2 measure_3 max_0 max_1 max_2 max_3
0 3.0 NaN NaN 11 10 0 0
1 NaN NaN NaN 12 0 0 0
2 2.0 1.0 NaN 11 13 11 0
3 2.0 NaN NaN 5 12 0 0
You can convert the 0 to np.nan if required. This will be at the expense of converting your series from int to float, since NaN is considered float.
Of course, if X and 0 are valid values, you can use alternative filler values from the start.

Applying values to column and grouping all columns by those values

I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M

Add column in dataframe from list

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Iterrows a rolling sum

I have a pandas dataframe
from pandas import DataFrame, Series
where each row corresponds to one case, and each column corresponds to one month. I want to perform a rolling sum over each 12 month period. Seems simple enough, but I'm getting stuck with
result = [x for x.rolling_sum(12) in df.iterrows()]
result = [x for x.rolling_sum(12) in df.T.iteritems()]
SyntaxError: can't assign to function call
a = []
for x in df.iterrows():
s = x.rolling_sum(12)
a.append(s)
AttributeError: 'tuple' object has no attribute 'rolling_sum'
I think perhaps what you are looking for is
pd.rolling_sum(df, 12, axis=1)
In which case, no list comprehension is necessary. The axis=1 parameter causes Pandas to compute a rolling sum over rows of df.
For example,
import numpy as np
import pandas as pd
ncols, nrows = 13, 2
df = pd.DataFrame(np.arange(ncols*nrows).reshape(nrows, ncols))
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11 12
# 0 0 1 2 3 4 5 6 7 8 9 10 11 12
# 1 13 14 15 16 17 18 19 20 21 22 23 24 25
print(pd.rolling_sum(df, 12, axis=1))
prints
0 1 2 3 4 5 6 7 8 9 10 11 12
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 66 78
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 222 234
Regarding your list comprehension:
You've got the parts of the list comprehension in the wrong order. Try:
result = [expression for x in df.iterrows()]
See the docs for more about list comprehensions.
The basic form of a list comprehension is
[expression for variable in sequence]
And the resultant list is equivalent to result after Python executes:
result = []
for variable in sequence:
result.append(expression)
See this link for full syntax for list comprehensions.

Categories

Resources