how to assign numbers to values in a dataframe - python

lets say i have a dataframe
A B C D
john I agree Average
ryan II agree agree
rose V strongly agree disagree
Shawn VI disagree agree
what i want to do is to assign numbers to C column values like this ?
A B C D
john I 1 3
ryan II 1 1
rose V 2 0
Shawn VI 0 1
i can use map for single column but if there are more than one column how do i change values to numbers without writing individual for every column (i know i could use for loops but the problem is that how would i apply it in here)
anyone know how to do this?
i tried to use a for loop
def assignNumbers(df):
for i in df:
second= df[i].map({'Average':3, 'Agree':1, 'Disagree':0, 'Strongly Agree':2})
return second

It can easily be done with scikit-learn's Label Encoder.
le = LabelEncoder()
df['C'] = le.fit_transform(df['C'])

Related

Count if with 2 conditions - Python

I'm having some trouble solving this, so I come here for your help.
I have a dataframe with many columns, and I want to count how many cells of a specific column meet the condition of another column. In Excel this would be count.if, but I can't figure it out exactly for my problem. Let me give you an example.
Names Detail
John B
John B
John S
Martin S
Martin B
Robert S
In this df for example there are 3 "B" and 3 "S" in total.
How can I get how many "B" and "S" there are for each name in column A?
Im trying to get a result dataframe like
B S
John 2 1
Martin 1 1
Robert 0 1
I tried
b_var = sum(1 for i in df['Names'] if i == 'John')
s_var = sum(1 for k in df['Detail'] if k == 'B')
and then make a for? but I don't know how to do both conditions at a time, or is it better a groupby approach?
Thanks!!
df.pivot_table(index='Names', columns='Detail', aggfunc=len)

Dataframe specific transposition optimisation

I would like to transpose a Pandas Dataframe from row to columns, where number of rows is dynamic. Then, transposed Dataframe must have dynamic number of columns also.
I succeeded using iterrows() and concat() methods, but I would like to optimize my code.
Please find my current code:
import pandas as pd
expected_results_transposed = pd.DataFrame()
for i, r in expected_results.iterrows():
t = pd.Series([r.get('B')], name=r.get('A'))
expected_results_transposed = pd.concat([expected_results_transposed, t], axis=1)
print("CURRENT CASE EXPECTED RESULTS TRANSPOSED:\n{0}\n".format(expected_results_transposed))
Please find an illustration of expected result :
picture of expected result
Do you have any solution to optimize my code using "standards" Pandas dataframes methods/options ?
Thank you for your help :)
Use DataFrame.transpose + DataFrame.set_index:
new_df=df.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
Example
df2=pd.DataFrame({'A':'Mike Ana Jon Jenny'.split(),'B':[1,2,3,4]})
print(df2)
A B
0 Mike 1
1 Ana 2
2 Jon 3
3 Jenny 4
new_df=df2.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
print(new_df)
Mike Ana Jon Jenny
0 1 2 3 4

Calculating mean value of item in several columns in pandas

I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64

How to get the number of times a piece if word is inside a particular column in pandas?

I'll try to use a simple example to describe my problem.
I have a csv file with many columns. One of this columns' header is "names".
In this column "names" I need only the times the name "John" is repeated.
As an example, my column "names" is as follows:
names
John
John M
Mike John
Audrey
Andrew
For this case I would need a python script using pandas to get the value of 3 because the word 'John' is repeated three times.
These are the codes I am using:
from_csv = pd.read_csv(r'csv.csv', usecols = ['names'] , index_col=0, header=0 )
times = from_csv.query('names == "John"').names.count()
But it only returns me 1, because there is only one row that has only John.
I have tried using:
times = from_csv.query('names == "*John*"').names.count()
but no success.
How can I get the 3 for this particular situation? thanks
Using str.contains
df.Name.str.contains('John').sum()
Out[246]: 3
Or we using list and map with in
sum(list(map(lambda x : 'John' in x,df.Name)))
Out[248]: 3
You can use pandas.Series.str.count to count the number of times in each row a pattern is encountered.
df.names.str.count('John').sum()
3
In this example, it matches OP's output. However, this would produce different results if John appeared more than once in one row. Suppose we had this df instead:
df
names
0 John John
1 John M John M
2 Mike John Mike John
3 Audrey Audrey
4 Andrew Andrew
Then my answer produces
df.names.str.count('John').sum()
6
While Wen's answer produces
df.names.str.contains('John').sum()
3

Changing the value in a dataframe column depending on the value of two columns in a diffrent dataframe

I have two dataframes made with Pandas in python:
df1
id business state inBusiness
1 painter AL no
2 insurance AL no
3 lawyer OH no
4 dentist NY yes
...........
df2
id business state
1 painter NY
2 painter AL
3 builder TX
4 painter AL
......
Basically, I want to set the 'inBusiness' value in df1 to 'yes' if an instance of the exact same business/location combo exists in df2.
So for example, if painter/AL exists in df2, than all instances of painter/AL in df1 have their 'inBusiness' value set to yes.
The best I can come up with right now is this:
for index, row in df2.iterrows():
df1[ (df1.business==str(row['business'])) & (df1.state==str(row['state']))]['inBusiness'] = 'Yes'
but the first dataframe can potentially have hundreds of thousands of rows to loop through for each row in the second dataframe so this method is not very reliable. Is there a nice one-liner I can use here that would also be quick?
You could use .merge(how='left', indicator=True) (indicator was added in pandas>=0.17, see docs) to identify matching columns as well as the source of the match to get something along these lines:
df1.merge(df2, how='left', indicator=True) # merges by default on shared columns
id business state inBusiness _merge
0 1 painter AL no both
1 2 insurance AL no left_only
2 3 lawyer OH no left_only
3 4 dentist NY yes left_only
The _merge indicates in which cases the (business, state) combinations are available in both df1 and df2. Then you just need to:
df['inBusiness'] = df._merge == 'both'
to get:
id business state inBusiness _merge
0 1 painter AL True both
1 2 insurance AL False left_only
2 3 lawyer OH False left_only
3 4 dentist NY False left_only
Probably most efficient to create a map
inBusiness = {(business,state): 'yes'
for business,state in zip(df2['business'],df2['state'])}
df1['inBusiness'] = [ inBusiness.get((business,state),"no")
for business,state in zip(df1['business'],df1['state'])]
df1
OUTPUTS
id business state inBusiness
0 1 painter AL yes
1 2 insurance AL no
2 3 lawyer OH no
3 4 dentist NY no
Explanation Edit:
You were vague about "explaining further," so I'll give a high level of everything
The built-in zip function that takes two iterables (like two lists, or two series) and "zips" them together into tuples.
a = [1,2,3]
b = ['a','b','c']
for tup in zip(a,b): print(tup)
outputs:
(1, 'a')
(2, 'b')
(3, 'c')
Additionally, tuples in python can be "unpacked" into individual variables
tup = (3,4)
x,y = tup
print(x)
print(y)
You can combine these two things to create dictionary comprehensions
newDict = {k: v for k,v in zip(a,b)}
newDict
Outputs:
{1: 'a', 2: 'b', 3: 'c'}
inBusiness is a python dictionary created using a dictionary comprehension after zipping together the series of df2['business'] and df2['state'].
I did not actually need to unpack the variables, but I did so for what I thought would be clarity.
Note that this map is only half of what you're hopping to do because every key (business,state) in the dictionary maps to yes. Thankfully, dict.get let's us specify a default value to return if the key is not found- which in your case is "no"
Then, the desired column is created using a list-comprehension to achieve your desired result.
Does that cover everything?

Categories

Resources