Pandas DataFrame to multidimensional NumPy Array - python

I have a Dataframe which I want to transform into a multidimensional array using one of the columns as the 3rd dimension.
As an example:
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'date': np.random.randint(1, 6, 6),
'value1': [11, 12, 13, 14, 15, 16],
'value2': [21, 22, 23, 24, 25, 26]
})
I would like to transform it into a 3D array with dimensions (id, date, values) like this:
The problem is that the 'id's do not have the same number of occurrences so I cannot use np.reshape().
For this simplified example, I was able to use:
ra = np.full((3, 3, 3), np.nan)
for i, value in enumerate(df['id'].unique()):
rows = df.loc[df['id'] == value].shape[0]
ra[i, :rows, :] = df.loc[df['id'] == value, 'date':'value2']
To produce the needed result:
but the original DataFrame contains millions of rows.
Is there a vectorized way to accomplice the same result?

Approach #1
Here's one vectorized approach after sorting id col with df.sort_values('id', inplace=True) as suggested by #Yannis in comments -
count_id = df.id.value_counts().sort_index().values
mask = count_id[:,None] > np.arange(count_id.max())
vals = df.loc[:, 'date':'value2'].values
out_shp = mask.shape + (vals.shape[1],)
out = np.full(out_shp, np.nan)
out[mask] = vals
Approach #2
Another with factorize that doesn't require any pre-sorting -
x = df.id.factorize()[0]
y = df.groupby(x).cumcount().values
vals = df.loc[:, 'date':'value2'].values
out_shp = (x.max()+1, y.max()+1, vals.shape[1])
out = np.full(out_shp, np.nan)
out[x,y] = vals

Related

How to edit all data value given in a dataframe except for the values of a particular index?

I have a dataframe consisting of float64 values in it. I have to divide each value by hundred except for the the values of the row of index no. 388. For that I wrote the following code.
Dataset
Preprocessing:
df = pd.read_csv('state_cpi.csv')
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['Month']=df['Name'].map(d)
r = {'Rural':1, 'Urban':2, 'Rural+Urban':3}
df['Region_code']=df['Sector'].map(r)
df['Himachal Pradesh'] = df['Himachal Pradesh'].str.replace('--','NaN')
df['Himachal Pradesh'] = df['Himachal Pradesh'].astype('float64')
Extracting the data of use:
data = df.iloc[:,3:-2]
Applying the division on the data dataframe
data[:,:388] = (data[:,:388] / 100).round(2)
data[:,389:] = (data[:,389:] / 100).round(2)
It returned me a dataframe where the data of row no. 388 was also divided by 100.
Dataset
As an example, I give the created dataframe. Indices except for 10 are copied into the aaa list. These index numbers are then supplied when querying and 1 is added to each element. The row with index 10 remains unchanged.
df = pd.DataFrame({'a': [1, 23, 4, 5, 7, 7, 8, 10, 9],
'b': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
index=[1, 2, 5, 7, 8, 9, 10, 11, 12])
aaa = df[df.index != 10].index
df.loc[aaa, :] = df.loc[aaa, :] + 1
In your case, the code will be as follows:
aaa = data[data.index != 388].index
data.loc[aaa, :] = (data.loc[aaa, :] / 100).round(2)

how to delete rows and columns in numpy python?

I am having trouble creating a function which takes a matrix M as an input and deletes BOTH rows and columns containing the number 0 and giving an output containing the remaining numbers. Any help is much appreciated as I have my programming exam coming up soon.
By "deleting both rows and columns" this is what I mean:
import numpy as np
x = np.array([[1,2,3,4,5],
[6,0,8,9,10],
[11,12,13,14,15],
[16,0,0,19,20]])
idxs_array = list(np.where(x==0))
idxs_array = [list(dict.fromkeys(x)) for x in idxs_array]
for axis, idxs in enumerate(idxs_array):
sub_factor = 0
for idx in idxs:
x = np.delete(x,idx-sub_factor,axis)
sub_factor += 1
print(x)
# x = [[ 1, 4, 5],
# [11, 14, 15]]
1. Locate zero elements
First of all, we need to identify the location of the zero elements in the matrix, which can be done easily with np.where().
np.where will return the row/column indices of the elements matched specific condition (doc).
row_idx, col_idx = np.where(arr == 0)
2. Remove corresponding rows/columns
To remove corresponding rows and columns, there is an easy way to do this, which is indexing (doc).
That is, you can specify the row (or column) you want to keep with True, else it shall be False.
print(np.arange(4)[[True, False, True, False]])
# array([0, 2])
3. Put two things together
Here is a minimal example.
arr = np.array([[ 1, 2, 3, 4, 5],
[ 6, 0, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 0, 0, 19, 20]])
row_idx, col_idx = np.where(arr == 0)
rm_row_idx = set(row_idx.tolist())
rm_col_idx = set(col_idx.tolist())
row_mask = [i not in rm_row_idx for i in range(arr.shape[0])]
col_mask = [i not in rm_col_idx for i in range(arr.shape[1])]
arr = arr[row_mask, :]
arr = arr[:, col_mask]
print(arr)
# Shall be:
# array([[ 1, 4, 5],
# [11, 14, 15]])

Pick first element of groupby object's group by index without converting to list

In the code below, I am iterating groups of groupby object and printing the first item in column
b of each group.
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].tolist()[0]
print(first_item_in_b)
Because the groupby has hierarchical index, in order to pick the first element in b I need to
convert b to list first.
How can I avoid such overhead?
I cannot just remove tolist() like so:
first_item_in_b = group['b'][0]
because it will give KeyError.
You can use Index.get_loc for get position of column b, so possible use iat or iloc only or by first value of index with column name by Series.at.
Or is possible select by position by Series.iat or Series.iloc after selecting by column label b:
for name, group in groups:
#first value by positions from columns names
first_item_in_b = group.iat[0, group.columns.get_loc('b')]
#first value by labels from index
first_item_in_b = group.at[group.index[0],'b']
#fast select first value
first_item_in_b = group['b'].iat[0]
#alternative
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
10
20
30
Using iloc:
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
OUTPUT:
10
20
30
EDIT:
Or Using the Fast integer location scalar accessor.

Pandas Indexing and Column Creation

I have a dataset, df.
I extracted another dataset from df, df_rec, based on a certain condition.
I can access the indexes of df_rec by df_rec.index.
Now, I want to create a column in df, where the index in df if matches with indexes in df_rec should be populated as 1 otherwise 0.
Any help, will be appreciated.
I am thinking, like, which throws error.
df['reccurences'] = 0
df['reccurences'][df.index in df_rec.index] = 1
You can use map on the index of df to chek whether it is in df_res and set the value accordingly as shown below.
df = pd.DataFrame()
df['X'] = [1, 2, 3, 4, 5, 6]
df['Y'] = [10, 20, 30, 40, 50, 60]
df_res = df.loc[df['X'] > 3]
df['C'] = df.index.map(lambda x : 1 if x in df_res.index else 0)
OR you can do like this
df['C'] = [1 if x in df_res.index else 0 for x in df.index]

How do I get the index of each item in a groupby object in Pandas?

I use groupby on a dataframe based on the columns I want and then I have to take the index of each item in its group. By index I mean, if there are 10 items in a group, the index goes from 0 to 9, not the dataframe index.
My code for doing this is below:
import pandas as pd
df = pd.DataFrame({'A': np.random.randint(0, 11, 10 ** 3), 'B': np.random.randint(0, 11, 10 ** 3),
'C': np.random.randint(0, 11, 10 ** 3), 'D': np.random.randint(0, 2, 10 ** 3)})
grouped_by = df.groupby(["A", "B", "C"])
groups = dict(list(grouped_by))
index_dict = {k: v.index.tolist() for k,v in groups.items()}
df["POS"] = df.apply(lambda x: index_dict[(x["A"], x["B"], x["C"])].index(x.name), axis=1)
The dataframe here is just an example.
Is there a way to use the grouped_by to achieve this ?
Here's a solution using cumcount() on a dummy variable to generate a item index for each group. It should be significantly faster too.
In [122]: df['dummy'] = 0
...: df["POS"] = df.groupby(['A','B','C'])['dummy'].cumcount()
...: df = df.drop('dummy', axis=1)
As #unutbu noted, even cleaner just to use:
df["POS"] = df.groupby(['A','B','C']).cumcount()

Categories

Resources