How can we fill the empty values in the column? - python

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]

You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

Related

Add or Subract two columns in a dataframe on basis of column?

I have df with has three columns name,amount and type.
I'm trying to add or subract values to user on basis of type
Here's my sample df
name amount type
0 John 10 ADD
1 John 20 ADD
2 John 50 ADD
3 John 50 SUBRACT
4 Adam 15 ADD
5 Adam 25 ADD
6 Adam 5 ADD
7 Adam 30 SUBRACT
8 Mary 100 ADD
My resultant df
name amount
0 John 30
1 Adam 15
2 Mary 100
Idea is multiple by 1 if ADD and -1 if SUBRACT column and then aggregate sum:
df1 = (df['amount'].mul(df['type'].map({'ADD':1, 'SUBRACT':-1}))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
Detail:
print (df['type'].map({'ADD':1, 'SUBRACT':-1}))
0 1
1 1
2 1
3 -1
4 1
5 1
6 1
7 -1
8 1
Name: type, dtype: int64
Also is possible specify only negative values with numpy.where for multiple by -1 and all another by 1:
df1 = (df['amount'].mul(np.where(df['type'].eq('SUBRACT'), -1, 1))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
One idea could be to use Series.where to change the sign of amount accordingly and then groupby.sum:
df.amount.where(df.type.eq('ADD'), -df.amount).groupby(df.name).sum().reset_index()
name amount
0 Adam 15
1 John 30
2 Mary 100

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0
you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0
You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)
you can achive that using unique
df['Name'].unique()
Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])
df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first
How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

I'm trying to split a Pandas DataFrame into multiple separate DataFrames where one of the columns is evenly distributed among the resulting DataFrames. For example, if I wanted the following DataFrame split into 3 distinct DataFrames where each one contains one record of each sector (selected at random).
So a df that looks like this:
id Name Sector
1 John A
2 Steven A
3 Jane A
4 Kyle A
5 Ashley B
6 Ken B
7 Tom B
8 Peter B
9 Elaine C
10 Tom C
11 Adam C
12 Simon C
13 Stephanie D
14 Jan D
15 Marsha D
16 David D
17 Drew E
18 Kit E
19 Corey E
20 James E
Would yield two DataFrames, one of which could look like this, while the other consist of the remaining records.
id Name Sector
1 John A
2 Steven A
7 Tom B
8 Peter B
10 Tom C
11 Adam C
13 Stephanie D
16 David D
19 Corey E
20 James E
I know np.array_split(df, 2) will get me part way there, but it may not evenly distribute the sectors like I need.
(Edited for clarity)
Update per comments and updated question:
df_1=df.groupby('Sector', as_index=False, group_keys=False).apply(lambda x: x.sample(n=2))
df_2 = df[~df.index.isin(df_1.index)]
print(df_1)
id Name Sector
2 3 Jane A
3 4 Kyle A
7 8 Peter B
5 6 Ken B
11 12 Simon C
9 10 Tom C
12 13 Stephanie D
15 16 David D
19 20 James E
17 18 Kit E
print(df_2)
id Name Sector
0 1 John A
1 2 Steven A
4 5 Ashley B
6 7 Tom B
8 9 Elaine C
10 11 Adam C
13 14 Jan D
14 15 Marsha D
16 17 Drew E
18 19 Corey E
Here is a "funky" method, using sequential numbering and random sampling:
df['grp'] = df.groupby('Sector')['Sector']\
.transform(lambda x: x.notna().cumsum().sample(frac=1))
dd = dict(tuple(df.groupby('grp')))
Output:
dd[1]
id Name Sector grp
0 1 John A 1
4 5 Ken B 1
6 7 Elaine C 1
dd[2]
id Name Sector grp
2 3 Jane A 2
5 6 Tom B 2
7 8 Tom C 2
dd[3]
id Name Sector grp
1 2 Steven A 3
3 4 Ashley B 3
8 9 Adam C 3
Details:
Create a sequence of numbers in each sector group starting from 1,
then randomize than number in the group to create a grouping key,
grp.
Use grp to groupby then create a dictionary, with keys for each grp.
Here's my way, you can groupbyby sector and randomly select from each group with a loop using the sample function:
for x, i in df.groupby('Sector'):
print(i.sample())
If you need multiple random selection use the sample function to specify how many items you want. For example:
for x, i in df.groupby('Sector'):
print(i.sample(2))
will return 2 random values from each group.

Creating dataframe from another dataframe and list

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done
You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

Iterating and averaging pandas data frame

I have a database with a lot of rows such as:
timestamp name price profit
bob 5 4
jim 3 2
jim 2 6
bob 6 7
jim 4 1
jim 6 3
bob 3 1
The data base is sorted by a timestamp. I would like to be able to add a new column where it would take the last 2 values in the price column before the current value and average them into a new column. So that the first three rows would look something like this with a new column:
timestamp name price profit new column
bob 5 4 4.5
jim 3 2 3
jim 2 6 5
(6+3)/2 = 4.5
(2+4)/2 = 3
(4+6)/2 = 5
This isn't for a school project or anything this is just something I'm working on on my own. I've tried asking a similar question to this but I don't think I was very clear. Thanks in advance!
def shift_n_roll(df):
return df.shift(-1).rolling(2).mean().shift(-1)
df['new column'] = df.groupby('name').price.apply(shift_n_roll)
df
By looking at the result you want, I'm guess you want average of the two prices following the current one instead of "2 values in the price column before the current value".
I made up timestamp values that you omitted to be clear.
print df
timestamp name price profit
0 2016-01-01 bob 5 4
1 2016-01-02 jim 3 2
2 2016-01-03 jim 2 6
3 2016-01-04 bob 6 7
4 2016-01-05 jim 4 1
5 2016-01-06 jim 6 3
6 2016-01-07 bob 3 1
#No need to sort if you already did.
#df.sort_values(['name','timestamp'], inplace=True)
df['new column'] = (df.groupby('name')['price'].shift(-1) + df.groupby('name')['price'].shift(-2)) / 2
print df.dropna()
timestamp name price profit new column
0 2016-01-01 bob 5 4 4.5
1 2016-01-02 jim 3 2 3.0
2 2016-01-03 jim 2 6 5.0

Categories

Resources