I'm having some trouble solving this, so I come here for your help.
I have a dataframe with many columns, and I want to count how many cells of a specific column meet the condition of another column. In Excel this would be count.if, but I can't figure it out exactly for my problem. Let me give you an example.
Names Detail
John B
John B
John S
Martin S
Martin B
Robert S
In this df for example there are 3 "B" and 3 "S" in total.
How can I get how many "B" and "S" there are for each name in column A?
Im trying to get a result dataframe like
B S
John 2 1
Martin 1 1
Robert 0 1
I tried
b_var = sum(1 for i in df['Names'] if i == 'John')
s_var = sum(1 for k in df['Detail'] if k == 'B')
and then make a for? but I don't know how to do both conditions at a time, or is it better a groupby approach?
Thanks!!
df.pivot_table(index='Names', columns='Detail', aggfunc=len)
Related
I would like to transpose a Pandas Dataframe from row to columns, where number of rows is dynamic. Then, transposed Dataframe must have dynamic number of columns also.
I succeeded using iterrows() and concat() methods, but I would like to optimize my code.
Please find my current code:
import pandas as pd
expected_results_transposed = pd.DataFrame()
for i, r in expected_results.iterrows():
t = pd.Series([r.get('B')], name=r.get('A'))
expected_results_transposed = pd.concat([expected_results_transposed, t], axis=1)
print("CURRENT CASE EXPECTED RESULTS TRANSPOSED:\n{0}\n".format(expected_results_transposed))
Please find an illustration of expected result :
picture of expected result
Do you have any solution to optimize my code using "standards" Pandas dataframes methods/options ?
Thank you for your help :)
Use DataFrame.transpose + DataFrame.set_index:
new_df=df.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
Example
df2=pd.DataFrame({'A':'Mike Ana Jon Jenny'.split(),'B':[1,2,3,4]})
print(df2)
A B
0 Mike 1
1 Ana 2
2 Jon 3
3 Jenny 4
new_df=df2.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
print(new_df)
Mike Ana Jon Jenny
0 1 2 3 4
I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay
lets say i have a dataframe
A B C D
john I agree Average
ryan II agree agree
rose V strongly agree disagree
Shawn VI disagree agree
what i want to do is to assign numbers to C column values like this ?
A B C D
john I 1 3
ryan II 1 1
rose V 2 0
Shawn VI 0 1
i can use map for single column but if there are more than one column how do i change values to numbers without writing individual for every column (i know i could use for loops but the problem is that how would i apply it in here)
anyone know how to do this?
i tried to use a for loop
def assignNumbers(df):
for i in df:
second= df[i].map({'Average':3, 'Agree':1, 'Disagree':0, 'Strongly Agree':2})
return second
It can easily be done with scikit-learn's Label Encoder.
le = LabelEncoder()
df['C'] = le.fit_transform(df['C'])
I'll try to use a simple example to describe my problem.
I have a csv file with many columns. One of this columns' header is "names".
In this column "names" I need only the times the name "John" is repeated.
As an example, my column "names" is as follows:
names
John
John M
Mike John
Audrey
Andrew
For this case I would need a python script using pandas to get the value of 3 because the word 'John' is repeated three times.
These are the codes I am using:
from_csv = pd.read_csv(r'csv.csv', usecols = ['names'] , index_col=0, header=0 )
times = from_csv.query('names == "John"').names.count()
But it only returns me 1, because there is only one row that has only John.
I have tried using:
times = from_csv.query('names == "*John*"').names.count()
but no success.
How can I get the 3 for this particular situation? thanks
Using str.contains
df.Name.str.contains('John').sum()
Out[246]: 3
Or we using list and map with in
sum(list(map(lambda x : 'John' in x,df.Name)))
Out[248]: 3
You can use pandas.Series.str.count to count the number of times in each row a pattern is encountered.
df.names.str.count('John').sum()
3
In this example, it matches OP's output. However, this would produce different results if John appeared more than once in one row. Suppose we had this df instead:
df
names
0 John John
1 John M John M
2 Mike John Mike John
3 Audrey Audrey
4 Andrew Andrew
Then my answer produces
df.names.str.count('John').sum()
6
While Wen's answer produces
df.names.str.contains('John').sum()
3
I have a dataframe where row 'code' is populated with codes, and row 'note' is populated with notes. Since the codes mean something I want to count their frequencies. E.g. with .value_counts(), and then I also want to know what note is attached to anyone of the unique codes.
For example, the code A has at one of the rows, the note 'adam'. Now I want to count how many A there is, and display one of the notes to anyone of the A's. (I don't want to count each code seperately, but to show the frequency for all codes at once)
Example:
IN:
code note
A adam
A august
A abdul
B bree
B bar
A august
B barnie
B barnie
C ceasar
C coolio
A august
OUT:
A 5 adam
B 4 bree
C 2 ceasar
Use agg with two aggfuncs - count, and first:
df.groupby('code').note.agg(['count', 'first'])
count first
code
A 5 adam
B 4 bree
C 2 ceasar