How to concatenate partially sequential occurring rows in data frame using pandas - python

I have a csv as follows. which is broken into multiple rows.
like as follows
Names,text,conv_id
tim,hi,1234
jon,hello,1234
jon,how,1234
jon,are you,1234
tim,hey,1234
tim,i am good,1234
pam, me too,1234
jon,great,1234
jon,hows life,1234
So i want to concatenate the sequentially occuring elements into one row
as follows and make it more meaningful
Expected output:
Names,text,conv_id
tim,hi,1234
jon,hello how are you,1234
tim,hey i am good,1234
pam, me too,1234
jon,great hows life,1234
I tried a couple of things but I failed and couldn't do can anyone please guide me how to do this?
Thanks in advance.

You can use Series.shift + Series.cumsum
to be able to create the appropriate groups through groupby and then use join applied to each group using groupby.apply.'conv_id', an 'Names' are added to the groups so that they can be retrieved using Series.reset_index. Finally, DataFrame.reindex is used to place the columns in the initial order
groups=df['Names'].rename('groups').ne(df['Names'].shift()).cumsum()
new_df=( df.groupby([groups,'conv_id','Names'])['text']
.apply(lambda x: ','.join(x))
.reset_index(level=['Names','conv_id'])
.reindex(columns=df.columns) )
print(new_df)
Names text conv_id
1 tim hi 1234
2 jon hello,how,are you 1234
3 tim hey,i am good 1234
4 pam me too 1234
5 jon great,hows life 1234
Detail:
print(groups)
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 5
dtype: int64

Related

Pandas groupby over consecutive duplicates

Given a table,
Id
Value
1
1
2
2
2
3
3
4
4
5
4
6
2
8
2
3
1
1
Instead of a simple groupby('Id').agg({'Value':'sum'}) which would perform aggregation over all the instances and yield a table with only four rows, I wish the result to aggregate only over the nearby instances and hence maintaining the order the table was created.
The expected output is following,
Id
Value
1
1
2
5
3
4
4
11
2
11
1
1
If not possible with pandas groupby, any other kind of trick would also be greatly appreciated.
Note: If the above example is not helpful, basically what I want is to somehow compress the table with aggregating over 'Value'. The aggregation should be done only over the duplicate 'Id's which occur one exactly after the other.
Unfortunately, the answers from eshirvana and wwnde doesn't seem to work for a long dataset. Inspired from answer of wwnde, I found a workaround,
# create a series referring to group of identicals
new=[]
i=-1
for item in df.Id:
if item !=seen:
i+=1
seen=items
new.append(i)
df['temp']=new
Now, we groupby over 'temp' column.
df.groupby('temp').agg({'Id':max, 'Value':sum}).reset_index(drop=True)

how to concatenate two cells in a pandas column based on some conditions?

Hello I have this pandas dataframe:
Key Predictions
C10D1 1
C11D1 8
C11D2 2
C12D1 2
C12D2 8
C13D1 3
C13D2 9
C14D1 4
C14D2 9
C15D1 8
C15D2 3
C1D1 5
C2D1 7
C3D1 4
C4D1 1
C4D2 9
C5D1 3
C5D2 2
C6D1 1
C6D2 0
C7D1 8
C7D2 6
C8D1 3
C8D2 3
C9D1 5
C9D2 1
I want to concatenate each cells from "Prediction" column where the "Key" matches up to 4 character.
For Example... in the "Key" column I have "C11D1" and "C11D2".. as they both contain "C11" i would like to concatente rows from prediction column that has "C11D1" and "C11D2" as index ..
Thus the result Should be:
Predictions
Key
C10 1
C11 82
C12 28
and so on
EDIT: Since OP wants to concatenate values of same index so adding that solution here.
df.groupby(df['Key'].replace(regex=True,to_replace=r'(C[0-9]+).*',value=r'\1'))\
['Predictions'].apply(lambda x: ','.join(map(str,x)))
Above will concatenate them with , you could set it to null or space as per your need in lambda x: ',' section.
Could you please try following.
df.groupby(df['Key'].replace(regex=True,to_replace=r'(C[0-9]+).*',value=r'\1')).sum()
OR with resetting index try:
df.groupby(df['Key'].replace(regex=True,to_replace=r'(C[0-9]+).*',value=r'\1')).sum()\
.reset_index()
Explanation: Adding explanation for above code.
df.groupby(df['Key'].replace(regex=True,to_replace=r'(C[0-9]+).*',value=r'\1')).sum()
df.groupby: Means use groupby for df whatever values passed to it.
df['Key'].replace(regex=True,to_replace=r'(C[0-9]+).*',value=r'\1'): Means df's key column I am using regex to replace everything after Cdigits with NULL as per OP's question.
.sum(): Means to get total sum of all similar 1st column as per need.

Calculating mean value of item in several columns in pandas

I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64

How to compress or stack a pandas dataframe along the rows?

I have a large pandas dataframe with several columns, however lets focus on two:
df = pd.DataFrame([['hey how are you', 'fine thanks',1],
['good to know', 'yes, and you',2],
['I am fine','ok',3],
['see you','bye!',4]],columns=list('ABC'))
df
Out:
A B C
0 hey how are you fine thanks 1
1 good to know yes, and you 2
2 I am fine ok 3
3 see you bye! 4
From the previous data frame how can I compress two specific columns into a single pandas dataframe carrying out the values of the other columns? For example:
A C
0 hey how are you 1
1 fine thanks 1
2 good to know 2
3 yes, and you 2
4 I am fine 3
5 ok 3
6 see you 4
7 bye! 4
I tried to:
df = df['A'].stack()
df = df.groupby(level=0)
df
However, it doesnt work. Any idea of how to achieve the new format?
This will drop the column names, but gets the job done:
import pandas as pd
df = pd.DataFrame([['hey how are you', 'fine thanks'],
['good to know', 'yes, and you'],
['I am fine','ok'],
['see you','bye!']],columns=list('AB'))
df.stack().reset_index(drop=True)
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
dtype: object
The default stack behaviour keeps the column names:
df.stack()
0 A hey how are you
B fine thanks
1 A good to know
B yes, and you
2 A I am fine
B ok
3 A see you
B bye!
dtype: object
You can select the columns for stacking if you have more of them, just use the column indexing:
df[["A", "B"]].stack()
With additional columns, things get tricky, you need to align indices by dropping one level (containing columns):
df["C"] = range(4)
stacked = df[["A", "B"]].stack()
stacked.index = stacked.index.droplevel(level=1)
stacked
0 hey how are you
0 fine thanks
1 good to know
1 yes, and you
2 I am fine
2 ok
3 see you
3 bye!
dtype: object
Now we can concat with C column:
pd.concat([stacked, df["C"]], axis=1)
0 C
0 hey how are you 0
0 fine thanks 0
1 good to know 1
1 yes, and you 1
2 I am fine 2
2 ok 2
3 see you 3
3 bye! 3
You can flatten() (or reshape(-1, )) the values of the DataFrame, which are stored as a numpy array:
pd.DataFrame(df.values.flatten(), columns=['A'])
A
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
Comments: The default behaviour of np.ndarray.flatten and np.ndarray.reshape is what you want, which is to vary the column index faster than the row index in the original array. This is so-called row-major (C-style) order. To vary the row index faster than the column index, pass in order='F' for column-major, Fortran-style ordering. Docs: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html
What you may be looking for is pandas.concat.
It accepts "a sequence or mapping of Series, DataFrame, or Panel objects", so you can pass a list of your DataFrame objects selecting the columns (which will be pd.Series when indexed for a single column).
df3 = pd.concat([df['A'], df['B']])

Transforming Dataframe Columns in Python

If I have a pandas Dataframe like such
and I want to transform it in a way that it results in
Is there a way to achieve this on the most correct way? a good pattern
Use a pivot table:
pd.pivot_table(df,index='name',columns=['property'],aggfunc=sum).fillna(0)
Output:
price
Property boat dog house
name
Bob 0 5 4
Josh 0 2 0
Sam 3 0 0
Sidenote: Pasting in your df's helps so people can use pd.read_clipboard instead of generating the df themselves.

Categories

Resources