Existing Df :
Id dates
01 ['ATIVE 04/2018 to 03/2020',' XYZ mar 2020 – Jul 2022','June 2021 - 2023 XYZ']
Expected Df :
Id dates
01 ['04/2018 to 03/2020','mar 2020 – Jul 2022','June 2021 - 2023']
I am looking to clean the List under the dates column. i tried it with below function but doesn't serve the purpose. Any leads on the same..?
def clean_dates_list(dates_list):
cleaned_dates_list = []
for date_str in dates_list:
cleaned_date_str = re.sub(r'[^A-Za-z\s\d]+', '', date_str)
cleaned_dates_list.append(cleaned_date_str)
return cleaned_dates_list
ls = ['ATIVE 04/2018 to 03/2020', ' XYZ mar 2020 – Jul 2022', 'June 2021 - 2023 XYZ']
ls_to_remove = ['ATIVE', 'XYZ']
for item in ls:
ls_str = item.split()
new_item = str()
for item in ls_str:
if item in ls_to_remove:
continue
new_item += " " + item
print(new_item)
I don't know your list of words to remove and it's not a good practice. But in your case it works.
i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp# 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
i am using the above pattern to drop but stuck. any other approach.?
You can try to split the pattern construction to multiple strings in complicated case like this:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
Prints:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022
I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a column with 4 values like below in a dataframe :
Have attached the image below for better understanding
Input
India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992
Output
Place
India Chennai
India Chennai
(Asia) India Chennai
India Chennai
Date
24 Oct 1992
Oct 1992
22 Oct 1992
1992
I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.
I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.
Would appreciate if somebody could help !!
Apologies for the format of Input and Output data !!
Use:
import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')
Input
col
0 India,Chennai - 24 Oct 1992
1 India,-Chennai,Oct 1992
2 India,-Chennai, 1992
3 (Asia) India,Chennai-22 Oct 1992
Output
col Place Date
0 India,Chennai - 24 Oct 1992 India Chennai 24 Oct 1992
1 India,-Chennai,Oct 1992 India Chennai Oct 1992
2 India,-Chennai, 1992 India Chennai 1992
3 (Asia) India,Chennai-22 Oct 1992 (Asia) India Chennai 22 Oct 1992
There are lot of ways to create columns by using Pandas library in python,
you can create by creating list or by list of dictionaries or by dictionaries of list.
for simple understanding here i am going to use lists
first import pandas as pd
import pandas as pd
creating a list from given data
data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]
creating dataframe from list
df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))
print
print(df)
output will be as
Country City Date Year
0 India chennai 24 Oct 1992
1 India chennai 23 Oct 1992
2 India chennai 23 Oct 1992
3 India chennai 21 Oct 1992
hope this will help you
The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!
>>> import re
>>> text_array
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']
# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]
>>> tmp
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]
# join the last two fields together to get the digit back.
>>> r = [(i[0], "".join(i[1:])) for i in tmp]
>>> r
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]
If you have control over the how input is generated then I would suggest that
the input is made more consistent and then we can parse using a tool like
pandas or directly with csv.
Hope this helps.
Regards,
Prasanth
Python code:
import re
import pandas as pd
input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)
p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []
with open(csv_file, encoding='utf-8', errors='ignore') as f:
for line in f:
s = re.sub("[,-]", " ", line.strip())
s = re.sub("\s+", " ", s)
r = p.search(s)
str_date = r.group()
dates.append(str_date)
place = s[0:s.find(str_date)]
places.append(place)
dict = {'Place': places,
'Date': dates
}
df = pd.DataFrame(dict)
print(df)
Output:
Place Date
0 India Chennai 24 Oct 1992
1 India Chennai Oct 1992
2 (Asia) India Chennai 22 Oct 1992
3 India Chennai 1992
You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()