Pandas Indexing and Column Creation - python

I have a dataset, df.
I extracted another dataset from df, df_rec, based on a certain condition.
I can access the indexes of df_rec by df_rec.index.
Now, I want to create a column in df, where the index in df if matches with indexes in df_rec should be populated as 1 otherwise 0.
Any help, will be appreciated.
I am thinking, like, which throws error.
df['reccurences'] = 0
df['reccurences'][df.index in df_rec.index] = 1

You can use map on the index of df to chek whether it is in df_res and set the value accordingly as shown below.
df = pd.DataFrame()
df['X'] = [1, 2, 3, 4, 5, 6]
df['Y'] = [10, 20, 30, 40, 50, 60]
df_res = df.loc[df['X'] > 3]
df['C'] = df.index.map(lambda x : 1 if x in df_res.index else 0)
OR you can do like this
df['C'] = [1 if x in df_res.index else 0 for x in df.index]

Related

How to compare the each elements in the delimited string in pandas data frame column with a python list object elements

I have a data frame that has a delimited string column that has to be compared with a list. If the result of the elements in the delimited string and elements of the list intersect, consider that row.
For example
test_lst = [20, 45, 35]
data = pd.DataFrame({'colA': [1, 2, 3],
'colB': ['20,45,50,60', '22,70,35', '10,90,100']})
should have the output as because the elements 20,45 are common in both the list variable and delimited text in DF in the first row.
Likewise, 35 intersects in row 2
colA
colB
1
20,45,50,60
2
22,70,35
What I have tried is
test_lst = [20, 45, 35]
data["colC"]= data['colB'].str.split(',')
data
# data["colC"].apply(lambda x: set(x).intersection(test_lst))
print(data[data['colC'].apply(lambda x: set(x).intersection(test_lst)).astype(bool)])
data
Does not give the required result.
Any help is appreciated
This might not be the best approach, but it works.
import pandas as pd
df = pd.DataFrame({'colA': [1, 2, 3],
'colB': ['20,45,50,60', '22,70,35', '10,90,100']})
def match_element(row):
row_elements = [int(n) for n in row.split(',')]
test_lst = [20, 45, 35]
if [value for value in row_elements if value in test_lst]:
return True
else:
return False
mask = df['colB'].apply(lambda row: match_element(row))
df = df[mask]
output:
colA
colB
0
1
20,45,50,60
1
2
22,70,35

concatenate in place in sub function with pandas concat function?

I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60

Pick first element of groupby object's group by index without converting to list

In the code below, I am iterating groups of groupby object and printing the first item in column
b of each group.
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].tolist()[0]
print(first_item_in_b)
Because the groupby has hierarchical index, in order to pick the first element in b I need to
convert b to list first.
How can I avoid such overhead?
I cannot just remove tolist() like so:
first_item_in_b = group['b'][0]
because it will give KeyError.
You can use Index.get_loc for get position of column b, so possible use iat or iloc only or by first value of index with column name by Series.at.
Or is possible select by position by Series.iat or Series.iloc after selecting by column label b:
for name, group in groups:
#first value by positions from columns names
first_item_in_b = group.iat[0, group.columns.get_loc('b')]
#first value by labels from index
first_item_in_b = group.at[group.index[0],'b']
#fast select first value
first_item_in_b = group['b'].iat[0]
#alternative
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
10
20
30
Using iloc:
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
OUTPUT:
10
20
30
EDIT:
Or Using the Fast integer location scalar accessor.

Pandas DataFrame to multidimensional NumPy Array

I have a Dataframe which I want to transform into a multidimensional array using one of the columns as the 3rd dimension.
As an example:
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'date': np.random.randint(1, 6, 6),
'value1': [11, 12, 13, 14, 15, 16],
'value2': [21, 22, 23, 24, 25, 26]
})
I would like to transform it into a 3D array with dimensions (id, date, values) like this:
The problem is that the 'id's do not have the same number of occurrences so I cannot use np.reshape().
For this simplified example, I was able to use:
ra = np.full((3, 3, 3), np.nan)
for i, value in enumerate(df['id'].unique()):
rows = df.loc[df['id'] == value].shape[0]
ra[i, :rows, :] = df.loc[df['id'] == value, 'date':'value2']
To produce the needed result:
but the original DataFrame contains millions of rows.
Is there a vectorized way to accomplice the same result?
Approach #1
Here's one vectorized approach after sorting id col with df.sort_values('id', inplace=True) as suggested by #Yannis in comments -
count_id = df.id.value_counts().sort_index().values
mask = count_id[:,None] > np.arange(count_id.max())
vals = df.loc[:, 'date':'value2'].values
out_shp = mask.shape + (vals.shape[1],)
out = np.full(out_shp, np.nan)
out[mask] = vals
Approach #2
Another with factorize that doesn't require any pre-sorting -
x = df.id.factorize()[0]
y = df.groupby(x).cumcount().values
vals = df.loc[:, 'date':'value2'].values
out_shp = (x.max()+1, y.max()+1, vals.shape[1])
out = np.full(out_shp, np.nan)
out[x,y] = vals

How do I get the index of each item in a groupby object in Pandas?

I use groupby on a dataframe based on the columns I want and then I have to take the index of each item in its group. By index I mean, if there are 10 items in a group, the index goes from 0 to 9, not the dataframe index.
My code for doing this is below:
import pandas as pd
df = pd.DataFrame({'A': np.random.randint(0, 11, 10 ** 3), 'B': np.random.randint(0, 11, 10 ** 3),
'C': np.random.randint(0, 11, 10 ** 3), 'D': np.random.randint(0, 2, 10 ** 3)})
grouped_by = df.groupby(["A", "B", "C"])
groups = dict(list(grouped_by))
index_dict = {k: v.index.tolist() for k,v in groups.items()}
df["POS"] = df.apply(lambda x: index_dict[(x["A"], x["B"], x["C"])].index(x.name), axis=1)
The dataframe here is just an example.
Is there a way to use the grouped_by to achieve this ?
Here's a solution using cumcount() on a dummy variable to generate a item index for each group. It should be significantly faster too.
In [122]: df['dummy'] = 0
...: df["POS"] = df.groupby(['A','B','C'])['dummy'].cumcount()
...: df = df.drop('dummy', axis=1)
As #unutbu noted, even cleaner just to use:
df["POS"] = df.groupby(['A','B','C']).cumcount()

Categories

Resources