Remove Specific characters in dataframe - python

I have a dataframe as below
Example 1
Some data frame always has 9 characters in tagname column before Last full stop '.'
| id | tagname | datetime |
|----|---------------------------------|-------------------------|
| 1 | A2.WAM_A.ACCESS_CLOSE_FAULT_LH | 2022-01-20 15:07:36.310 |
| 2 | A2.WAM_A.ACCESS_SENSOR_FAULT_RH | 2022-01-20 15:07:36.310 |
| 3 | A2.WAM_A.OUTPUT_POWER_CP_FAULT | 2022-01-20 15:07:36.310 |
If I use,
df['tagname'] = df['tagname'].str[9:]
Output: -
| id | tagname | datetime |
|----|------------------------|-------------------------|
| 1 | ACCESS_CLOSE_FAULT_LH | 2022-01-20 15:07:36.310 |
| 2 | ACCESS_SENSOR_FAULT_RH | 2022-01-20 15:07:36.310 |
| 3 | OUTPUT_POWER_CP_FAULT | 2022-01-20 15:07:36.310 |
Example 2
But in some tables I have diff length characters & multiple dot's in tagname before Last full stop '.', like below
| id | tagname | datetime |
|----|----------------------------------------------|-------------------------|
| 1 | A2.AC.CH.CONDITION.1ST_VACUUM_CH | 2021-09-28 17:31:48.191 |
| 2 | A2.AC.CH.CONDITION.SMALL_LEAK_TEST_VACUUM_CH | 2021-09-28 17:31:48.193 |
| 3 | A2.AC.CH.CONDITION.VACC_VALUE_CH_FLOAT_R270 | 2021-09-28 17:31:48.196 |
| 4 | A2.CP01.PRL2_TRIM1.ONCHANGE.PRL2 | 2021-09-28 17:31:48.199 |
| 5 | AY2.CP01.DL5_TRIM1.ONCHANGE.DL5 | 2021-09-28 17:31:48.199 |
Requirement: Output
Remove all characters in column tagname before that last dot '.' in tagname column
| id | tagname | datetime |
|----|---------------------------|-------------------------|
| 1 | 1ST_VACUUM_CH | 2021-09-28 17:31:48.191 |
| 2 | SMALL_LEAK_TEST_VACUUM_CH | 2021-09-28 17:31:48.193 |
| 3 | VACC_VALUE_CH_FLOAT_R270 | 2021-09-28 17:31:48.196 |
| 4 | PRL2 | 2021-09-28 17:31:48.199 |
| 5 | DL5 | 2021-09-28 17:31:48.199 |

Use str.rsplit:
Example 1:
df1['tagname'] = df1['tagname'].str.rsplit('.', 1).str[1]
print(df1)
# Output
id tagname datetime
0 1 ACCESS_CLOSE_FAULT_LH 2022-01-20 15:07:36.310
1 2 ACCESS_SENSOR_FAULT_RH 2022-01-20 15:07:36.310
2 3 OUTPUT_POWER_CP_FAULT 2022-01-20 15:07:36.310
Example 2:
df2['tagname'] = df2['tagname'].str.rsplit('.', 1).str[1]
print(df2)
# Output
id tagname datetime
0 1 1ST_VACUUM_CH 2021-09-28 17:31:48.191
1 2 SMALL_LEAK_TEST_VACUUM_CH 2021-09-28 17:31:48.193
2 3 VACC_VALUE_CH_FLOAT_R270 2021-09-28 17:31:48.196
3 4 PRL2 2021-09-28 17:31:48.199
4 5 DL5 2021-09-28 17:31:48.199

Related

How do I make a grouping that includes only values from the list?

I have a table presented below.
+------+-------+--------------+------------------------------+------+
| good | store | date_id | map_dates | sale |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-01-01' | ['2018-07-08'] | 10 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2019-05-06' | ['2019-01-01', '2018-07-08'] | 5 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2019-10-12' | ['2018-12-01'] | 24 |
+------+-------+--------------+------------------------------+------+
| 1 | 2 | '2018-07-08' | [] | 3 |
+------+-------+--------------+------------------------------+------+
| 5 | 4 | '2018-12-01' | [] | 15 |
+------+-------+--------------+------------------------------+------+
I want to group by columns good, store, and include only the dates specified in the map_dates column in the result. For example:
+------+-------+--------------+----------+
| good | store | date_id | sum_sale |
+------+-------+--------------+----------+
| 1 | 2 | '2019-01-01' | 3 |
+------+-------+--------------+----------+
| 1 | 2 | '2019-05-06' | 13 |
+------+-------+--------------+----------+
| 5 | 4 | '2019-10-12' | 15 |
+------+-------+--------------+----------+
How can I do this without using a loop?
First we explode, then we match our values by an inner merge on good, store, map_dates and date_id. Finally we GroupBy.sum:
dfn = df.explode('map_dates')
dfn = dfn.merge(dfn,
left_on=['good', 'store', 'map_dates'],
right_on=['good', 'store', 'date_id'],
suffixes=['', '_sum'])
dfn = dfn.groupby(['good', 'store', 'date_id'])['sale_sum'].sum().reset_index(
good store date_id sale_sum
0 1 2 2019-01-01 3
1 1 2 2019-05-06 13
2 5 4 2019-10-12 15

Can we alter pandas cross tabulation?

I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.

i have from my original dataframe obtained another two , how can i merge in a final one the columns that i need

i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.

Pandas merge two dataframe and drop extra rows

How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)

Splitting a Graphlab SFrame Date column into three columns (Year Month Day)

Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.

Categories

Resources