Populating pandas dataframe efficiently using a 2-D numpy array

Populating pandas dataframe efficiently using a 2-D numpy array - python

I have a 2-D numpy array each row of which consists of three elements - ['dataframe_column_name', 'dataframe_index', 'value'].
Now, I tried populating the pandas dataframe using iloc double for loop but it is quite slow. Is there any faster way of doing this. I am a bit new to pandas, so apologies in case this is something very basic.
Here is the code snippet :
my_nparray = [['a', 1, 123], ['b', 1, 230], ['a', 2, 321]]
for r in range(my_nparray.shape[0]):
[col, ind, value] = my_nparray[r]
df.iloc[col][ind] = value
This takes a lot of time when my_nparray is large, is there any other way of doing this?
Initially assume that I can create this data frame :
'a' 'b'
1 NaN NaN
2 NaN NaN
I want the output as :
'a' 'b'
1 123 230
2 321 NaN

You can use from_records and then pivot:
df = pd.DataFrame.from_records(my_nparray, index=1).pivot(columns=0)
2
0 a b
1
1 123.0 230.0
2 321.0 NaN
This specifies that the index uses field 1 from your array and pivot uses Series 0 for the columns.
Then we can reset the MultiIndex on the columns and the index:
df.columns = df.columns.droplevel(None)
df.columns.name = None
df.index.name = None
a b
1 123.0 230.0
2 321.0 NaN

Use DataFrame constructor with DataFrame.pivot and DataFrame.rename_axis:
df = pd.DataFrame(my_nparray).pivot(1,0,2).rename_axis(index=None, columns=None)
print (df)
a b
1 123.0 230.0
2 321.0 NaN

Related

Pandas groupby diff removes column

I have a dataframe like this:
d = {'id': ['101_i','101_e','102_i','102_e'], 1: [3, 4, 5, 7], 2: [5,9,10,11], 3: [8,4,3,7]}
df = pd.DataFrame(data=d)
I want to subtract all rows which have the same prefix id, i.e. subtract all values of rows 101_i with 101_e or vice versa. The code I use for that is:
df['new_identifier'] = [x.upper().replace('E', '').replace('I','').replace('_','') for x in df['id']]
df = df.groupby('new_identifier')[df.columns[1:-1]].diff().dropna()
I get the output like this:
I see that I lose the new column that I create, new_identifier. Is there a way I can retain that?

You can define specific aggregation function (in this case np.diff() for columns 1, 2, and 3) for columns that you know the types (int or float in this case).
import numpy as np
df.groupby('new_identifier').agg({i: np.diff for i in range(1, 4)}).dropna()
Result:
1 2 3
new_identifier
101 1 4 -4
102 2 1 4

Series.str.split to get groups, you need DataFrame.set_axis() before GroupBy, after that we use GroupBy.diff
cols = df.columns.difference(['id'])
groups = df['id'].str.split('_').str[0]
new_df = (
df.set_axis(groups, axis=0)
.groupby(level=0)
[cols]
.diff()
.dropna()
)
print(new_df)
1 2 3
id
101 1.0 4.0 -4.0
102 2.0 1.0 4.0
Detail Groups
df['id'].str.split('_').str[0]
0 101
1 101
2 102
3 102
Name: id, dtype: object

Reshaping DataFrame with pandas

So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.

This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN

Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN

I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN

insert missing rows in df with dictionary values

Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)

Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2

If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN

Concatenate all columns in dataframe except for NaN

Another simple one. I have a DataFrame (1056 x 39) that contains reference variables from a pivot table. I now need to generate a column of concatenated values of all columns, which exclude NaNs. The trouble is that I have quite a few NaNs which are interfering with the output.
Based on another post that I have found Concatenating all columns in pandas dataframe, I can use this approach.
df['Merge'] = df.astype(str).agg(' or '.join,axis=1)
The trouble is that NaNs remain. How can I modify this line to exclude NaN values (skip them essentially) such that the output will only contain concatenated values.
The intended output should appear as (first row):
df['Merge'][0] = 'Var1 or Var2 or Var 20 or Var28' (all NaN values were excluded)
Thanks :)

You can stack to remove the NaN then cast to string and groupby + str.join
import pandas as pd
df = pd.DataFrame([[1.0, np.NaN, 2, 3, 'foo'], [np.NaN, None, 5, 'bar', 'bazz']])
df['merged'] = df.stack().astype(str).groupby(level=0).agg(' or '.join)
# 0 1 2 3 4 merged
#0 1.0 NaN 2 3 foo 1.0 or 2 or 3 or foo
#1 NaN NaN 5 bar bazz 5 or bar or bazz
Or you can apply along the rows, dropping nulls, casting to string then joining all the non-nulls.
df = pd.DataFrame([[1.0, np.NaN, 2, 3, 'foo'], [np.NaN, None, 5, 'bar', 'bazz']])
df['merged'] = df.apply(lambda row: ' or '.join(row.dropna().astype(str)), axis=1)
# 0 1 2 3 4 merged
#0 1.0 NaN 2 3 foo 1.0 or 2 or 3 or foo
#1 NaN NaN 5 bar bazz 5 or bar or bazz

Best way to add multiple list to existing dataframe [duplicate]

I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...

I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame

You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN

My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)

With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3

Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)

use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)

if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.

Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.

You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10

If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign

I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')

You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.

import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Populating pandas dataframe efficiently using a 2-D numpy array - python

Use DataFrame constructor with DataFrame.pivot and DataFrame.rename_axis: df = pd.DataFrame(my_nparray).pivot(1,0,2).rename_axis(index=None, columns=None) print (df) a b 1 123.0 230.0 2 321.0 NaN

Related

Pandas groupby diff removes column

Reshaping DataFrame with pandas

insert missing rows in df with dictionary values

Concatenate all columns in dataframe except for NaN

Best way to add multiple list to existing dataframe [duplicate]

Categories

Resources