Using Pandas to merge similar data

Using Pandas to merge similar data - python

How do I merge similar data such as "recommendation" into one value?
df['Why you choose us'].str.lower().value_counts()
location 35
recommendation 23
recommedation 8
confort 7
availability 4
reconmmendation 3
facilities 3

print(df)
reason count
0 location 35
1 recommendation 23
2 recommedation 8
3 confort 7
4 availability 4
5 reconmmendation 3
6 facilities 3
.groupby(), partial string..transform() while finding the sum
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
reason count groupcount
0 location 35 35
1 recommendation 23 34
2 recommedation 8 34
3 confort 7 7
4 availability 4 4
5 reconmmendation 3 34
6 facilities 3 3
If needed to see string and partial string side by side. Try
df=df.assign(groupname=df.reason.str[0:4])
df['groupcount']=df.groupby(df.reason.str[0:4])['count'].transform('sum')
print(df)
reason count groupname groupcount
0 location 35 loca 35
1 recommendation 23 reco 34
2 recommedation 8 reco 34
3 confort 7 conf 7
4 availability 4 avai 4
5 reconmmendation 3 reco 34
6 facilities 3 faci 3
Incase you have multiple items in a row like you have in the csv; then
#Read csv
df=pd.read_csv(r'path')
#Create another column which is a list of values 'Why you choose us' in each row
df['Why you choose us']=(df['Why you choose us'].str.lower().fillna('no comment given')).str.split(',')
#Explode group to ensure each unique reason is int its own row but with all the otehr attrutes intact
df=df.explode('Why you choose us')
#remove any white spaces before values in the column group and value_counts
df['Why you choose us'].str.strip().value_counts()
print(df['Why you choose us'].str.strip().value_counts())
location 48
no comment given 34
recommendation 25
confort 8
facilities 8
recommedation 8
price 7
availability 6
reputation 5
reconmmendation 3
internet 3
ac 3
breakfast 3
tranquility 2
cleanliness 2
aveilable 1
costumer service 1
pool 1
comfort 1
search engine 1
Name: group, dtype: int64

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?

You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')

Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

How can I split pandas dataframe into groups of peaks

I have a dataset in an excel file I'm trying to analyse.
Example data:
Time in s Displacement in mm Force in N
0 0 Not Relevant
1 1 Not Relevant
2 2 Not Relevant
3 3 Not Relevant
4 2 Not Relevant
5 1 Not Relevant
6 0 Not Relevant
7 2 Not Relevant
8 3 Not Relevant
9 4 Not Relevant
10 5 Not Relevant
11 6 Not Relevant
12 5 Not Relevant
13 4 Not Relevant
14 3 Not Relevant
15 2 Not Relevant
16 1 Not Relevant
17 0 Not Relevant
18 4 Not Relevant
19 5 Not Relevant
20 6 Not Relevant
21 7 Not Relevant
22 6 Not Relevant
23 5 Not Relevant
24 4 Not Relevant
24 0 Not Relevant
Imported from an xls file and then plotting a graph of time vs displacement:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(
'DATA.xls',
engine='xlrd', usecols=['Time in s', 'Displacement in mm', 'Force in N'])
fig, ax = plt.subplots()
ax.plot(df['Time in s'], df['Displacement in mm'])
ax.set(xlabel='Time (s)', ylabel='Disp',
title='time disp')
ax.grid()
fig.savefig("time_disp.png")
plt.show()
I'd like to split the data into multiple groups to analyse separately.
So if I plot displacement against time, I get a sawtooth as a sample is being cyclically loaded.
I'd like to split the data so that each "tooth" is its own group or dataset so I can analyse each cycle
Can anyone help?

you can create a column group with a value changing at each local minimum. First get True at a local minimum and use two diff once forward and once backward. Then use cumsum to increase the group number each time a local minimum is.
df['gr'] = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
print(df)
Time Deplacement gr
0 0 0 1
1 1 1 1
2 2 2 1
3 3 3 1
4 4 2 1
5 5 1 1
6 6 0 2
7 7 2 2
8 8 3 2
9 9 4 2
10 10 5 2
11 11 6 2
12 12 5 2
13 13 4 2
14 14 3 2
15 15 2 2
16 16 1 2
17 17 0 3
18 18 4 3
19 19 5 3
you can split the data by selecting each group individually, or you could do something with a loop and do anything you want in each loop.
s = (~(df['Deplacement'].diff(1)>0)
& ~(df['Deplacement'].diff(-1)>0)).cumsum()
for _, dfg in df.groupby(s):
print(dfg)
# analyze as needed
Edit: in the case of the data in your question with 0 as a minimum, then doing df['gr'] = df['Deplacement'].eq(0).cumsum() would work as well, but it is specific to minimum being exactly 0

Groupby top n records based on another column

Below is an example of the data set I am working with. I am trying to do a group by on Location for only the top 3 locations based on KGs.
Location KG Dollars
BKK 7 2
BKK 5 3
BKK 4 2
BKK 3 3
BKK 2 2
HKG 8 3
HKG 6 2
HKG 4 3
HKG 3 2
HKG 2 3
The output would look like below. Grouped on the location, summing both KG and Dollars for the top 3 KG records for each location.
Location KG Dollars
BKK 16 7
HKG 18 8
I've tried different types of groupbys, just having a problem only specifying the top n KG records for the groupby.

You could do
In [610]: df.groupby('Location').apply(lambda x:
x.nlargest(3, 'KG')[['KG', 'Dollars']].sum())
Out[610]:
KG Dollars
Location
BKK 16 7
HKG 18 8

Change Cells in Pandas DataFrame Based on Conditional Slices

I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass.
Here is some data:
train
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 NaN
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 Nan
Here is what I would like to end up with:
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 35
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 35
The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3:
Pclass_1 = train['Pclass']==1
train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)
As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). I particularly like this Q/A on the topic View vs Copy, How Do I Tell? but it doesn't include the insight I'm looking for.
Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. However I just can't seem to get the syntax right.
Pclass_1 = train.loc[:,['Pclass']==1]
Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)
I'm getting lost in indices. This one ends up looking for a column named False which obviously doesn't exist. I don't know how to do this without chained indexing. train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided.

In this part of the line,
train.loc[:,['Pclass']==1]
the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1, which returns False. The .loc[] is then evaluated as .loc[:,False] which is causing the error.
I think you mean:
train.loc[train['Pclass']==1]
which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning".
EDIT 1
(old code removed)
Here is an approach that uses groupby with transform to create a Series
containing the median Age for each Pclass. The Series is then used as the argument to fillna() to replace the missing values with the median. Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe
import pandas as pd
from io import StringIO
tbl = """PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1
"""
train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)
The code produces:
Original:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 NaN
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 NaN
NaNs replaced with median:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 35.0
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 35.0
One thing to note is that this line, which uses inplace=True:
train['Age'].fillna(median_age, inplace=True)
can be replaced with assignment using .loc:
train.loc[:,'Age'] = train['Age'].fillna(median_age)

I want to get the relative index of a column in a pandas dataframe

I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22

Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Pandas to merge similar data - python

How do I merge similar data such as "recommendation" into one value? df['Why you choose us'].str.lower().value_counts() location 35 recommendation 23 recommedation 8 confort 7 availability 4 reconmmendation 3 facilities 3

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

How can I split pandas dataframe into groups of peaks

Groupby top n records based on another column

Change Cells in Pandas DataFrame Based on Conditional Slices

I want to get the relative index of a column in a pandas dataframe

Categories

Resources