Dataframe specific transposition optimisation - python

I would like to transpose a Pandas Dataframe from row to columns, where number of rows is dynamic. Then, transposed Dataframe must have dynamic number of columns also.
I succeeded using iterrows() and concat() methods, but I would like to optimize my code.
Please find my current code:
import pandas as pd
expected_results_transposed = pd.DataFrame()
for i, r in expected_results.iterrows():
t = pd.Series([r.get('B')], name=r.get('A'))
expected_results_transposed = pd.concat([expected_results_transposed, t], axis=1)
print("CURRENT CASE EXPECTED RESULTS TRANSPOSED:\n{0}\n".format(expected_results_transposed))
Please find an illustration of expected result :
picture of expected result
Do you have any solution to optimize my code using "standards" Pandas dataframes methods/options ?
Thank you for your help :)

Use DataFrame.transpose + DataFrame.set_index:
new_df=df.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
Example
df2=pd.DataFrame({'A':'Mike Ana Jon Jenny'.split(),'B':[1,2,3,4]})
print(df2)
A B
0 Mike 1
1 Ana 2
2 Jon 3
3 Jenny 4
new_df=df2.set_index('A').T.reset_index(drop=True)
new_df.columns.name=None
print(new_df)
Mike Ana Jon Jenny
0 1 2 3 4

Related

Count if with 2 conditions - Python

I'm having some trouble solving this, so I come here for your help.
I have a dataframe with many columns, and I want to count how many cells of a specific column meet the condition of another column. In Excel this would be count.if, but I can't figure it out exactly for my problem. Let me give you an example.
Names Detail
John B
John B
John S
Martin S
Martin B
Robert S
In this df for example there are 3 "B" and 3 "S" in total.
How can I get how many "B" and "S" there are for each name in column A?
Im trying to get a result dataframe like
B S
John 2 1
Martin 1 1
Robert 0 1
I tried
b_var = sum(1 for i in df['Names'] if i == 'John')
s_var = sum(1 for k in df['Detail'] if k == 'B')
and then make a for? but I don't know how to do both conditions at a time, or is it better a groupby approach?
Thanks!!
df.pivot_table(index='Names', columns='Detail', aggfunc=len)

How to concatenate partially sequential occurring rows in data frame using pandas

I have a csv as follows. which is broken into multiple rows.
like as follows
Names,text,conv_id
tim,hi,1234
jon,hello,1234
jon,how,1234
jon,are you,1234
tim,hey,1234
tim,i am good,1234
pam, me too,1234
jon,great,1234
jon,hows life,1234
So i want to concatenate the sequentially occuring elements into one row
as follows and make it more meaningful
Expected output:
Names,text,conv_id
tim,hi,1234
jon,hello how are you,1234
tim,hey i am good,1234
pam, me too,1234
jon,great hows life,1234
I tried a couple of things but I failed and couldn't do can anyone please guide me how to do this?
Thanks in advance.
You can use Series.shift + Series.cumsum
to be able to create the appropriate groups through groupby and then use join applied to each group using groupby.apply.'conv_id', an 'Names' are added to the groups so that they can be retrieved using Series.reset_index. Finally, DataFrame.reindex is used to place the columns in the initial order
groups=df['Names'].rename('groups').ne(df['Names'].shift()).cumsum()
new_df=( df.groupby([groups,'conv_id','Names'])['text']
.apply(lambda x: ','.join(x))
.reset_index(level=['Names','conv_id'])
.reindex(columns=df.columns) )
print(new_df)
Names text conv_id
1 tim hi 1234
2 jon hello,how,are you 1234
3 tim hey,i am good 1234
4 pam me too 1234
5 jon great,hows life 1234
Detail:
print(groups)
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 5
dtype: int64

Combine two datasets of equal length in Python

I've got two datasets of equal length. Both only one column. I'm trying to combine them and make one dataset with two columns.
What I've tried gives me one column with all the values from the first dataframe. But the second column is al NaN's.
Please help.
I have tried .join & .merge & pd.concat & .add & ...
df_low_rename = df_low_sui.rename(index=str, columns={'suicides/100k pop': 'low_gdp'})
df_high_rename = df_high_sui.rename(index=str, columns={'suicides/100k pop': 'high_gdp'})
df_combined = df_low_rename.add(df_high_rename)
df_combined
Output
Pandas merge function works.
Dataset 1:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df1 = pd.DataFrame(data,columns=['Name','Age'])
print(df1)
output:
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Dataset 2:
data2 = [['Alex','Science'],['Bob','Physics'],['Clarke','Social']]
df2 = pd.DataFrame(data2,columns=['Name','Courses'])
print(df2)
output:
Name Courses
0 Alex Science
1 Bob Physics
2 Clarke Social
merging the datasets:
final=pd.merge(df1,df2)
output:
Name Age Courses
0 Alex 10 Science
1 Bob 12 Physics
2 Clarke 13 Social
I believe a join will do it for you. Something like this:
df_low_rename.join(df_high_rename)
Try with concat on the column axis:
combined = pandas.concat([df_low_rename, df_high_rename], axis=1)
Both data sets didn't have the same indexes. I fixed it like this:
df_low_rename = df_low_rename.reset_index(drop=True)
df_high_rename = df_high_rename.reset_index(drop=True)
Then I used the join function:
df_combined = df_low_rename.join(df_high_rename)
df_combined
This way, I got the right output. Thanks to everyone who tried to help me and I apologize for this rookie mistake.

How to compress or stack a pandas dataframe along the rows?

I have a large pandas dataframe with several columns, however lets focus on two:
df = pd.DataFrame([['hey how are you', 'fine thanks',1],
['good to know', 'yes, and you',2],
['I am fine','ok',3],
['see you','bye!',4]],columns=list('ABC'))
df
Out:
A B C
0 hey how are you fine thanks 1
1 good to know yes, and you 2
2 I am fine ok 3
3 see you bye! 4
From the previous data frame how can I compress two specific columns into a single pandas dataframe carrying out the values of the other columns? For example:
A C
0 hey how are you 1
1 fine thanks 1
2 good to know 2
3 yes, and you 2
4 I am fine 3
5 ok 3
6 see you 4
7 bye! 4
I tried to:
df = df['A'].stack()
df = df.groupby(level=0)
df
However, it doesnt work. Any idea of how to achieve the new format?
This will drop the column names, but gets the job done:
import pandas as pd
df = pd.DataFrame([['hey how are you', 'fine thanks'],
['good to know', 'yes, and you'],
['I am fine','ok'],
['see you','bye!']],columns=list('AB'))
df.stack().reset_index(drop=True)
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
dtype: object
The default stack behaviour keeps the column names:
df.stack()
0 A hey how are you
B fine thanks
1 A good to know
B yes, and you
2 A I am fine
B ok
3 A see you
B bye!
dtype: object
You can select the columns for stacking if you have more of them, just use the column indexing:
df[["A", "B"]].stack()
With additional columns, things get tricky, you need to align indices by dropping one level (containing columns):
df["C"] = range(4)
stacked = df[["A", "B"]].stack()
stacked.index = stacked.index.droplevel(level=1)
stacked
0 hey how are you
0 fine thanks
1 good to know
1 yes, and you
2 I am fine
2 ok
3 see you
3 bye!
dtype: object
Now we can concat with C column:
pd.concat([stacked, df["C"]], axis=1)
0 C
0 hey how are you 0
0 fine thanks 0
1 good to know 1
1 yes, and you 1
2 I am fine 2
2 ok 2
3 see you 3
3 bye! 3
You can flatten() (or reshape(-1, )) the values of the DataFrame, which are stored as a numpy array:
pd.DataFrame(df.values.flatten(), columns=['A'])
A
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
Comments: The default behaviour of np.ndarray.flatten and np.ndarray.reshape is what you want, which is to vary the column index faster than the row index in the original array. This is so-called row-major (C-style) order. To vary the row index faster than the column index, pass in order='F' for column-major, Fortran-style ordering. Docs: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
What you may be looking for is pandas.concat.
It accepts "a sequence or mapping of Series, DataFrame, or Panel objects", so you can pass a list of your DataFrame objects selecting the columns (which will be pd.Series when indexed for a single column).
df3 = pd.concat([df['A'], df['B']])

Transforming Dataframe Columns in Python

If I have a pandas Dataframe like such
and I want to transform it in a way that it results in
Is there a way to achieve this on the most correct way? a good pattern
Use a pivot table:
pd.pivot_table(df,index='name',columns=['property'],aggfunc=sum).fillna(0)
Output:
price
Property boat dog house
name
Bob 0 5 4
Josh 0 2 0
Sam 3 0 0
Sidenote: Pasting in your df's helps so people can use pd.read_clipboard instead of generating the df themselves.

Categories

Resources