Extract text after period "." from values in a column in Pandas Dataframes

Extract text after period "." from values in a column in Pandas Dataframes - python

I have a column in a dataframe as follows:
| Category |
------------
| B5050.88
| 5051.90
| B5050.97Q
| 5051.23B
| 5051.78E
| B5050.11
| 5051.09
| Z5052
I want to extract the text after the period. For example, from B5050.88, I want only "88"; from 5051.78E, I want only "78E"; for Z50502, it would be nothing as there's no period.
Expected output:
| Category | Digits |
---------------------
| B5050.88 | 88 |
| 5051.90 | 90 |
| B5050.97Q| 97Q |
| 5051.23B | 23B |
| 5051.78E | 78E |
| B5050.11 | 11 |
| 5051.09 | 09 |
| Z5052 | - |
I tried using this
df['Digits'] = df.Category.str.extract('.(.*)')
But I'm not getting the right answer. Using the above, for B5050.88, I'm getting the same B5050.88; for 5051.09, I'm getting NaN. Basically NaN if there's no text.

You can do
splits = [str(p).split(".") for p in df["Category"]]
df["Digits"] = [p[1] if len(p)>1 else "-" for p in splits]
i.e
df = pd.DataFrame({"Category":["5050.88","5051.90","B5050.97","5051.23B","5051.78E",
"B5050.11","5051.09","Z5052"]})
#df
# Category
# 0 5050.88
# 1 5051.90
# 2 B5050.97
# 3 5051.23B
# 4 5051.78E
# 5 B5050.11
# 6 5051.09
# 7 Z5052
splits = [str(p).split(".") for p in df["Category"]]
splits
# [['5050', '88'],
# ['5051', '90'],
# ['B5050', '97'],
# ['5051', '23B'],
# ['5051', '78E'],
# ['B5050', '11'],
# ['5051', '09'],
# ['Z5052']]
df["Digits"] = [p[1] if len(p)>1 else "-" for p in splits]
df
# Category Digits
# 0 5050.88 88
# 1 5051.90 90
# 2 B5050.97 97
# 3 5051.23B 23B
# 4 5051.78E 78E
# 5 B5050.11 11
# 6 5051.09 09
# 7 Z5052 -
not so pretty but it works
EDIT:
Added the "-" instead of NaN and the code snippet

Another way
df.Category.str.split('[\.]').str[1]
0 88
1 90
2 97Q
3 23B
4 78E
5 11
6 09
7 NaN
Alternatively
df.Category.str.extract('((?<=[.])(\w+))')

You need to escape your first . and do fillna:
df["Digits"] = df["Category"].astype(str).str.extract("\.(.*)").fillna("-")
print(df)
Output:
Category Digits
0 B5050.88 88
1 5051.90 90
2 B5050.97Q 97Q
3 5051.23B 23B
4 5051.78E 78E
5 B5050.11 11
6 5051.09 09
7 Z5052 -

try out below :
df['Category'].apply(lambda x : x.split(".")[-1] if "." in list(x) else "-")
check below code

Related

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |

There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

Group by sum date and fill all missing values with excedents from past dates untill count = 1

I have a dataframe with a grouped date and a count, if there is any gap in this time series I have to fill them with excess of previous stacks, and if no gaps extend the series until all counts = 1.
These examples happen all same month
NOTE: day_date is a timestamp with daily frequency where missing values are 0, did integer for simplicity in example
An example with missing gaps but no previous stacks:
| day_date | stack |
| -------- | ----- |
| 1 | 0 |
| 2 | 2 |
Produces
| day_date | stack |
| -------- | ----- |
| 1 | 0 |
| 2 | 1 | #
| 3 | 1 | # The entire period flattents to a day frequency with value = 1
An example of days being over stacked and filling gaps:
| day_date | stack |
| -------- | ----- |
| 1 | 0 |
| 2 | 2 | #this row wont be able to fill until the 6th
| 6 | 3 | #this row and below will craete overlap
| 8 | 2 |
| 15 | 1 | # there is a big gap here that will get filled as much as possible from previous overlap
Produces:
| day_date | stack |
| -------- | ----- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 | # the previous staack coverd only until the 3rd.
| 5 | 0 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 | #Here is an overal of last stack from 6 and 2 days from 8, this results on the two days from 8 moving forward to fill gaps as the day is covered from past stack.
| 9 | 1 | # there is a big gap here that will get filled as much as possible from previous overlap from the 8th, which is 2 days that fill 9th and 10th.
| 10 | 1 |
| 11 | 0 |
| 12 | 0 |
| 13 | 0 |
| 14 | 0 |
| 15 | 1 | #last stack.
Note that the reason 9th and 10th have a 1 is because the excess from the date 8 which was covered since the big refill that happened the 6th and covered from 6th to 8th.

EDIT: using timestamps
Maybe a more readable solution (for beginners) using for loops and a bunch of if statements:
import pandas as pd
lst = [[pd.Timestamp(year=2017, month=1, day=1), 0],
[pd.Timestamp(year=2017, month=1, day=2), 2],
[pd.Timestamp(year=2017, month=1, day=10), 3],
[pd.Timestamp(year=2017, month=2, day=1), 2],
[pd.Timestamp(year=2017, month=2, day=3), 2]]
df = pd.DataFrame(lst, columns=['day_date', 'stack'])
n_days = (df.day_date.max() - df.day_date.min()).days + 1
stack = 0
for index in range(n_days):
stack += df.loc[index, 'stack']
# insert new day
if index + 1 < len(df): # if you are not at the end of the dataframe
next_day = df.loc[index+1].day_date # compute the next day in dataframe
this_day = df.loc[index].day_date # compute this day
if df.loc[index, 'stack'] >= 1:
df.loc[index, 'stack'] = 1
stack -= 1
if this_day + pd.DateOffset(1) != next_day: # if there is a gap in days
for new_day in range(1, (next_day - this_day).days):
if stack > 0:
df.loc[len(df)] = [this_day + pd.DateOffset(new_day), 1]
stack -= 1
else:
df.loc[len(df)] = [this_day + pd.DateOffset(new_day), 0]
df = df.sort_values('day_date').reset_index(drop=True)
else:
if df.loc[index, 'stack'] >= 1:
df.loc[index, 'stack'] = 1
stack -= 1
while stack >= 1:
this_day = df.loc[len(df)-1].day_date
df.loc[len(df)] = [this_day + pd.DateOffset(1), 1]
stack -= 1

This is not such an easy task (if needed to perform in a vectorial way).
You can calculate first the remainder days to carry them to the next date, then use reindexing to duplicate/fill the rows:
remainder = (df['stack'].add(df['day_date'].diff(-1))
.fillna(0, downcast='infer').clip(lower=0)
)
df2 = (df
# shift extra "stack" to next stack
.assign(stack=df['stack'].sub(remainder).add(remainder.shift(fill_value=0)))
# repeat rows using "stack" value with a minimum of 1
.loc[lambda d: d.index.repeat(d['stack'].clip(lower=1))]
# make stack>1 equal to 1
# and increment the days per group
.assign(stack=lambda d: d['stack'].clip(upper=1),
day_date=lambda d: d['day_date'].add(
(m:=d['day_date'].duplicated())
.astype(int)
.groupby((~m).cumsum())
.cumsum()
)
)
# fill missing days (all remaining lines)
.set_index('day_date')
.reindex(range(df['day_date'].min(), df['day_date'].max()+1))
.fillna(0, downcast='infer')
.reset_index()
)
output:
day_date stack
0 1 0
1 2 1
2 3 1
3 4 0
4 5 0
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 0
11 12 0
12 13 0
13 14 0
14 15 1

How to find similar GPS coordinates in rows of same column?

Is there a way to identify which GPS coordinates represent same location. e.g. given the following Data Frame. How to tell that Id 1 and 2 are from same source location.
+-----+--------------+-------------+
| Id | VehLat | VehLong |
+-----+--------------+-------------+
| 66 | 63.3917005 | 10.4264724 |
| 286 | 63.429603 | 10.4167367 |
| 61 | 33.6687838 | 73.0755573 |
| 67 | 63.4150316 | 10.3980401 |
| 5 | 64.048128 | 10.083776 |
| 8 | 63.4332386 | 10.3971859 |
| 9 | 63.4305769 | 10.3927124 |
| 6 | 63.4293578 | 10.4164764 |
| 1 | 64.048254 | 10.084230 |
+-----+--------------+-------------+
Now, Ids 5 and 1 are basically same location but what's the best approach to classify these two locations as same.

IIUC, you need this.
df[['VehLat','VehLong']].round(3).duplicated(keep=False)
You can change the number within round to adjust what you consider as "same"
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
If you want the df itself with duplicate values, you can do as below
df[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
OR
df.loc[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
Output
id VehLat VehLong
1 286 63.429603 10.416737
4 5 64.048128 10.083776
7 6 63.429358 10.416476
8 1 64.048254 10.084230

Use DataFrame.sort_values + Series.between:
this allows you greater flexibility when establishing the criteria to
consider two coordinates as equivalent
df2=df[['VehLat','VehLong']].sort_values(['VehLong','VehLat'])
eq=df2.apply(lambda x: x.diff().between(-0.001,0.001)).all(axis=1)
df2[eq|eq.shift(-1)]
this returns a data frame with equivalent coordinates
VehLat VehLong
4 64.048128 10.083776
8 64.048254 10.084230
7 63.429358 10.416476
1 63.429603 10.416737
df2[~(eq|eq.shift(-1))]
this returns unique coordinates
VehLat VehLong
6 63.430577 10.392712
5 63.433239 10.397186
3 63.415032 10.398040
0 63.391700 10.426472
2 33.668784 73.075557
you can restore order using DataFrame.sort_index
df_noteq=df2[~(eq|eq.shift(-1))].sort_index()
print(df_noteq)
VehLat VehLong
0 63.391700 10.426472
2 33.668784 73.075557
3 63.415032 10.398040
5 63.433239 10.397186
6 63.430577 10.392712

Barplot comparing two columns

I would like to draw a barplot graph that would compare the evolution of 2 variables of revenues on a monthly time-axis (12 months of invoices).
I wanted to use sns.barplot, but can't use "hue" (cause the 2 variables aren't subcategories?). Is there another way, as simple as with hue? Can I "create" a hue?
Here is a small sample of my data:
(I did transform my table into a pivot table)
[In]
data_pivot['Revenue-Small-Seller-in'] = data_pivot["Small-Seller"] + data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot['Revenue-Not-Small-Seller-in'] = data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot
[Out]
InvoiceNo Month Year Revenue-Small-Seller-in Revenue-Not-Small-Seller-in
536365 12 2010 139.12 139.12
536366 12 2010 22.20 11.10
536367 12 2010 278.73 246.93
(sorry for the ugly presentation of my data, see the picture to see the complete table (as there are multiple columns))

You can do:
render_df = data_pivot[data_pivot.columns[-2:]]
fig, ax = plt.subplots(1,1)
render_df.plot(kind='bar', ax=ax)
ax.legend()
plt.show()
Output:
Or sns style like you requested
render_df = data_pivot[data_pivot.columns[-2:]].stack().reset_index()
sns.barplot('level_0', 0, hue='level_1',
render_df)
here render_df after stack() is:
+---+---------+-----------------------------+--------+
| | level_0 | level_1 | 0 |
+---+---------+-----------------------------+--------+
| 0 | 0 | Revenue-Small-Seller-in | 139.12 |
| 1 | 0 | Revenue-Not-Small-Seller-in | 139.12 |
| 2 | 1 | Revenue-Small-Seller-in | 22.20 |
| 3 | 1 | Revenue-Not-Small-Seller-in | 11.10 |
| 4 | 2 | Revenue-Small-Seller-in | 278.73 |
| 5 | 2 | Revenue-Not-Small-Seller-in | 246.93 |
+---+---------+-----------------------------+--------+
and output:

Having trouble discovering a way to make a calendar using Python

I need to make a calendar in Python that displays a formatted table of a month when the user inputs the days in the month and the date of the first Sunday. For example, if user inputs 30 days and the first Sunday is 6, then the calendar would look like this.
I have no idea which direction to go when going about this problem. I could use some help on how to approach this question. So far I've only learnt about booleans, conditionals, loops, etc. Also not allowed to import calendar. Thank you.

Please check if this meets the need
day_count=30
first_sunday=6
if first_sunday != 1:
i=int(first_sunday)-7
else:
i=1
count=1
print "Su Mo Tu We Th Fr Sa"
while i<=day_count:
if i<1:
print ("%-3s" % (" ")),
else:
print ("%-3s" % (str(i))),
if i>0 and count % 7 == 0:
print ""
i=i+1
count = count+1

#Blooper, I've written the below code in which the function show_formatted_calendar() that takes 5 parameters.
First 2 are required parameters & last 3 are optional parameters that you can pass if you want to change the look of your table.
In the code, I've called the show_formatted_calendar() 3 times.
show_formatted_calendar() returns a formatted calendar string that you can print on console, save in file or store in variable for reuse.
If you run it, it will ask you to enter the values of first 2 required parameters and uses default values for last 3 parameters (in first call).
In 2nd and 3rd calls it takes static calls for first 2 parameters and overrides the default values of the last 3 function parameters to change the look of table that you can see in the output provided at the bottom.
Customizing tables using optional parameters
spaces» Number of spaces before and after cell values. Default is 1.
fill_char» Charater to be used to join the 2 joints/corners. Default is -.
corner_char» A character used at joints or corners of calender table. default is +.
Please have a look at the below code.
def show_formatted_calendar(no_of_days, sun_starts_from, spaces=1, fill_char='-', corner_char='+'):
days = ['SUN', 'MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT'];
blank_fields = 0; # Blank fields in first row(may vary in range [1, 6])
if not(no_of_days >= 28 and no_of_days <= 31):
print 'Input Error: Number of days should be in interval [28, 31].';
return;
if not(sun_starts_from >= 1 and sun_starts_from <= 7):
print 'Input Error: Sunday should be inin interval [1, 7].'
return;
string = fill_char*spaces; # -
decorator_line = string + 3 * fill_char + string; # -----
separator_line = (corner_char + decorator_line) * 7 + corner_char; # +-----+-----+-----+-----+-----+-----+-----+
# First line
formatted_calendar = separator_line + '\n';
# Second line
line_spaces = ' ' * spaces;
days_string = "|" + line_spaces + (line_spaces + '|' + line_spaces).join(days) + line_spaces + '|';
formatted_calendar += days_string + '\n';
# Third line
formatted_calendar += separator_line + '\n';
# Fourth line (No of possible blank fields at the begining)
blank_fields = (8 - sun_starts_from) % 7; # 1=>0, 2=>6, 3=>5, 4=>4, 5=>5, 6=>2
blank_string = (('|' + line_spaces) + (3 * ' ') + line_spaces) * blank_fields;
date_string = '';
i = blank_fields + 1;
day = 1;
while day <= no_of_days:
date_string += '|' + line_spaces + '%-3s' % (day) + line_spaces;
if i % 7 == 0:
date_string += '|\n';
i += 1;
day += 1;
# No of possible blank fields in last line
last_blank_fields = 7 - ((no_of_days - (7 - blank_fields)) % 7);
last_blank_string = ('|' + line_spaces + 3 * ' ' + line_spaces) * last_blank_fields + '|';
formatted_calendar += (blank_string + date_string) + last_blank_string + '\n';
formatted_calendar += separator_line + '\n';
return formatted_calendar;
# Starts here
if __name__ == "__main__":
try:
no_of_days = int(raw_input('Enter number of days of month(>=28 & <=31) : ').strip());
sun_starts_from = int(raw_input('Sunday starts from which date(>=1 & <=7) : ').strip());
# First call
formatted_calendar = show_formatted_calendar(no_of_days, sun_starts_from);
print formatted_calendar;
# Second call (static input)
print "\nFor Days 31 days where sunday starts from 4:-\n"
formatted_calendar = show_formatted_calendar(31, 4, 2, '*', '.');
print formatted_calendar;
# Third call (static input)
print "\nFor Days 29 days where sunday starts from 2:-\n"
formatted_calendar = show_formatted_calendar(29, 2, 3, '~');
print formatted_calendar;
except Exception as error:
print 'Error occurred. ', error;
Output »
H:\RishikeshAgrawani\Projects\Sof\MonthTableGen>python MonthTableGen.py
Enter number of days of month(>=28 & <=31) : 31
Sunday starts from which date(>=1 & <=7) : 6
+-----+-----+-----+-----+-----+-----+-----+
| SUN | MON | TUE | WED | THU | FRI | SAT |
+-----+-----+-----+-----+-----+-----+-----+
| | | 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 | | |
+-----+-----+-----+-----+-----+-----+-----+
For Days 31 days where sunday starts from 4:-
.*******.*******.*******.*******.*******.*******.*******.
| SUN | MON | TUE | WED | THU | FRI | SAT |
.*******.*******.*******.*******.*******.*******.*******.
| | | | | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| | | | | | | |
.*******.*******.*******.*******.*******.*******.*******.
For Days 29 days where sunday starts from 2:-
+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+
| SUN | MON | TUE | WED | THU | FRI | SAT |
+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+
| | | | | | | 1 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| | | | | | | |
+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+~~~~~~~~~+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text after period "." from values in a column in Pandas Dataframes - python

Another way df.Category.str.split('[\.]').str[1] 0 88 1 90 2 97Q 3 23B 4 78E 5 11 6 09 7 NaN Alternatively df.Category.str.extract('((?<=[.])(\w+))')

You need to escape your first . and do fillna: df["Digits"] = df["Category"].astype(str).str.extract("\.(.*)").fillna("-") print(df) Output: Category Digits 0 B5050.88 88 1 5051.90 90 2 B5050.97Q 97Q 3 5051.23B 23B 4 5051.78E 78E 5 B5050.11 11 6 5051.09 09 7 Z5052 -

try out below : df['Category'].apply(lambda x : x.split(".")[-1] if "." in list(x) else "-") check below code

Related

Python - pandas remove duplicate rows based on condition

Group by sum date and fill all missing values with excedents from past dates untill count = 1

How to find similar GPS coordinates in rows of same column?

Barplot comparing two columns

Having trouble discovering a way to make a calendar using Python

Categories

Resources