Move one column to another dataframe pandas - python

I have a DataFrame df1 that looks like this:
userId movie1 movie2 movie3
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
and another DataFrame, df2 that looks like this:
userId movie4 movie5 movie6
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
How do I select one column from df2 and add it to df1? For example, adding movie6 to df1 would result:
userId movie1 movie2 movie3 movie6
0 4.1 0.0 1.0 1.0
1 3.1 1.1 3.4 3.4
2 2.8 0.0 1.7 1.7
3 0.0 5.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 2.3 0.0 2.0 2.0

df1=pd.concat([df1,df2['movie6']],axis=0)

You can merge on the shared column, userId:
df1 = df1.merge(df2[["userId","movie6"]], on="userId")

Related

Counting consonants and vowels in a split string

I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]
stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

Can't Re-Order Columns Data

I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

How to reindex data frame in Pandas?

I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.

pandas slice rows based on joint condition

consider the below dataframe -df
one two three four five six seven eight
0 0.1 1.1 2.2 3.3 3.6 4.1 0.0 0.0
1 0.1 2.1 2.3 3.2 3.7 4.3 0.0 0.0
2 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.1 1.2 2.5 3.7 4.4 0.0 0.0 0.0
4 1.7 2.1 0.0 0.0 0.0 0.0 0.0 0.0
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0
7 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0
9 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0
i want to select all rows where any columns value is '3.2' but at the same time the selected rows should not have values '0.1' or '1.2'
I can able to get the first part with the below query
df[df.values == 3.2]
but cannot combine this with the second part of the query (the joint != condition)
i also get the following error
DeprecationWarning: elementwise != comparison failed; this will raise an error in the future.
on the larger data set (but not on the smaller replica) when trying the below
df[df.values != [0.1,1.2]]
//edit:
#pensen, here is the output, rows 1, 15, 27, 35 have values '0.1' though as per the condition they should have been filtered.
contains = df.eq(3.2).any(axis=1)
not_contains = ~df.isin([0.1,1.2]).any(axis=1)
print(df[contains & not_contains])
0 1 2 3 4 5 6 7
1 0.1 2.1 3.2 0.0 0.0 0.0 0.0 0.0
15 0.1 1.1 2.2 3.2 3.3 3.6 3.7 0.0
27 0.1 2.1 2.3 3.2 3.6 3.7 4.3 0.0
31 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
35 0.1 1.7 2.1 3.2 3.6 3.7 4.3 0.0
here is the original dataset from 0:36 rows to replicate the above output
0 1 2 3 4 5 6 7
0 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.1 2.1 3.2 0.0 0.0 0.0 0.0 0.0
2 0.1 2.4 2.5 0.0 0.0 0.0 0.0 0.0
3 2.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.1 2.1 4.1 0.0 0.0 0.0 0.0 0.0
7 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 1.1 4.1 0.0 0.0 0.0 0.0 0.0 0.0
12 0.1 2.2 3.3 3.6 0.0 0.0 0.0 0.0
13 0.1 1.8 3.3 0.0 0.0 0.0 0.0 0.0
14 0.1 1.2 1.3 2.5 3.7 4.2 0.0 0.0
15 0.1 1.1 2.2 3.2 3.3 3.6 3.7 0.0
16 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
18 1.3 2.5 0.0 0.0 0.0 0.0 0.0 0.0
19 0.1 1.2 2.5 3.7 4.4 0.0 0.0 0.0
20 1.2 4.4 0.0 0.0 0.0 0.0 0.0 0.0
21 4.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 0.1 2.2 2.4 2.5 3.7 0.0 0.0 0.0
24 0.1 2.4 4.3 0.0 0.0 0.0 0.0 0.0
25 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 0.1 1.1 4.1 0.0 0.0 0.0 0.0 0.0
27 0.1 2.1 2.3 3.2 3.6 3.7 4.3 0.0
28 1.4 2.2 3.6 4.1 0.0 0.0 0.0 0.0
29 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30 1.2 4.4 0.0 0.0 0.0 0.0 0.0 0.0
31 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
32 3.6 4.1 0.0 0.0 0.0 0.0 0.0 0.0
33 2.1 2.4 0.0 0.0 0.0 0.0 0.0 0.0
34 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0
35 0.1 1.7 2.1 3.2 3.6 3.7 4.3 0.0
here is the link to the actual dataset
You can do the following in short:
df.eq(3.2).any(axis=1) & ~df.isin([0.1, 1.2]).any(axis=1)
Or here more explicitly:
contains = df.eq(3.2).any(axis=1)
not_contains = ~df.isin([0.1,1.2]).any(axis=1)
print(df[contains & not_contains])
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0
For performance, specially since you mentioned large dataset and if you are looking to exclude just two numbers, here's one approach with array data -
a = df.values
df_out = df.iloc[(a == 3.2).any(1) & (((a!=0.1) & (a!=1.2)).all(1))]
Sample run -
In [43]: a = df.values
In [44]: df.iloc[(a == 3.2).any(1) & (((a!=0.1) & (a!=1.2)).all(1))]
Out[44]:
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0 0 0 0
6 2.1 2.3 3.2 4.3 0 0 0 0
You could just combine the conditions.
>>> df[(df == 3.2).any(1) & ~df.isin([0.1, 1.2]).any(1)]
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0

How can I change a specific row label in a Pandas dataframe?

I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?
You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']
use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))

Categories

Resources