Adding pandas series on end of each pandas dataframe's row

Adding pandas series on end of each pandas dataframe's row - python

I've had issues finding a concise way to append a series to each row of a dataframe, with the series labels becoming new columns in the df. All the values will be the same on each of the dataframes' rows, which is desired.
I can get the effect by doing the following:
df["new_col_A"] = ser["new_col_A"]
.....
df["new_col_Z"] = ser["new_col_Z"]
But this is so tedious there must be a better way, right?

Given:
# df
A B
0 1 2
1 1 3
2 4 6
# ser
C a
D b
dtype: object
Doing:
df[ser.index] = ser
print(df)
Output:
A B C D
0 1 2 a b
1 1 3 a b
2 4 6 a b

Related

Rename name in Python Pandas MultiIndex

I try to rename a column name in a Pandas MultiIndex but it doesn't work. Here you can see my series object. Btw, why is the dataframe df_injury_record becoming a series object in this function?
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
In the next line you will see my try to rename the column.
Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
But after this, the column has still the same name.
Regards

One possible problem should be pandas version under 0.24 or you forget assign back like mentioned #anky_91:
df_injury_record = pd.DataFrame({'Surface':list('aaaabbbbddd'),
'BodyPart':list('abbbbdaaadd')})
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
print (Frequency_BodyPart)
Surface BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
Frequency_BodyPart = Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
print (Frequency_BodyPart)
Class BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
If want 3 columns DataFrame working also for oldier pandas versions:
df = Frequency_BodyPart.reset_index(name='count').rename(columns={'Surface': 'Class'})
print (df)
Class BodyPart count
0 a a 1
1 a b 3
2 b a 2
3 b b 1
4 b d 1
5 d a 1
6 d d 2

How to subset one row in dask.dataframe?

I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset correctly because I did reset_index(drop=False) and I could see the original index in the new column).
I read dask.dataframe document and it says something along the line that there might be more than one rows with index=0 due to how dask structuring the chunk data.
So, if I really want only one row by using index=0 for subsetting, how can I do this?

Edit
Probably, your problem comes from reset_index. This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.
For example, there is the following dask DataFrame:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]:
col_1 col_2
0 1 a
0 2 b
1 3 c
2 4 d
3 5 e
4 6 f
5 7 g
it has a numerical index with repeated 0 values. As loc is a
Purely label-location based indexer for selection by label
- it selects both 0-labeled values, if you'll do a
df.loc[0].compute()
Out[]:
col_1 col_2
0 1 a
0 2 b
- you'll get all the rows with 0-s (or another specified label).
In pandas there is a pd.DataFrame.iloc which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc is
Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
To beat this problem, you can do some indexing tricks:
df.compute()
Out[2]:
index col_1 col_2
x
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
4 3 5 e
5 4 6 f
6 5 7 g
- now, there's new index ranged from 0 to the length of the data frame - 1.
It's possible to slice it with the loc and do the following (I suppose that select 0 label via loc means "select first row"):
df.loc[0].compute()
Out[3]:
index col_1 col_2
x
0 0 1 a
About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the
df.loc[:, 'index'].compute()
Out[4]:
x
0 0
1 0
2 1
3 2
4 3
5 4
6 5
I guess, you get such a duplication from reset_index() or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:
df.reset_index().compute()
Out[5]:
index col_1 col_2
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
0 3 5 e
1 4 6 f
2 5 7 g

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.

Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

How do you concatenate two single rows in pandas?

I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.

If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3

Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.

It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3

Filling a column based on comparison of 2 other columns (pandas)

I am trying to do the following in pandas:
I have 2 DataFrames, both of which have a number of columns.
DataFrame 1 has a column A, that is of interest for my task;
DataFrame 2 has columns B and C, that are of interest.
What needs to be done: to go through the values in column A and see if the same values exists somewhere in column B. If it does, create a column D in Dataframe 1 and fill its respective cell with the value from C which is on the same row as the found value from B.
If the value from A does not exist in B, then fill the cell in D with a zero.
for i in range(len(df1)):
if df1['A'].iloc[i] in df2.B.values:
df1['D'].iloc[i] = df2['C'].iloc[i]
else:
df1['D'].iloc[i] = 0
This gives me an error: Keyword 'D'. If I create the column D in advance and fill it, for example, with 0's, then I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. How can I solve this? Or is there a better way to accomplish what I'm trying to do?
Thank you so much for your help!

If I understand correctly:
Given these 2 dataframes:
import pandas as pd
import numpy as np
np.random.seed(42)
df1=pd.DataFrame({'A':np.random.choice(list('abce'), 10)})
df2=pd.DataFrame({'B':list('abcd'), 'C':np.random.randn(4)})
>>> df1
A
0 c
1 e
2 a
3 c
4 c
5 e
6 a
7 a
8 c
9 b
>>> df2
B C
0 a 0.279041
1 b 1.010515
2 c -0.580878
3 d -0.525170
You can achieve what you want using a merge:
new_df = df1.merge(df2, left_on='A', right_on='B', how='left').fillna(0)[['A','C']]
And then just rename the columns:
new_df.columns=['A', 'D']
>>> new_df
A D
0 c -0.580878
1 e 0.000000
2 a 0.279041
3 c -0.580878
4 c -0.580878
5 e 0.000000
6 a 0.279041
7 a 0.279041
8 c -0.580878
9 b 1.010515

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding pandas series on end of each pandas dataframe's row - python

Given: # df A B 0 1 2 1 1 3 2 4 6 # ser C a D b dtype: object Doing: df[ser.index] = ser print(df) Output: A B C D 0 1 2 a b 1 1 3 a b 2 4 6 a b

Related

Rename name in Python Pandas MultiIndex

How to subset one row in dask.dataframe?

Duplicate row of low occurrence in pandas dataframe

How do you concatenate two single rows in pandas?

Filling a column based on comparison of 2 other columns (pandas)

Categories

Resources