How to isolate part of string in pandas dataframe - python

I have a dataframe containing a column of strings. I want to take out a part of each string in each row, which is the year and then create a new column and assign it to that column. My problem is to isolate the last part of the string. An example could be: 'TON GFR 2018 N' For this string I would be able to execute by running one of the following (For this I want to isolate 18 and not 2018).
new_data['Year'] = pd.DataFrame([str(ele[1])[:2] for ele in list(new_data['Name'].str.split('20'))])
new_data['Year'] = new_data['Name'].str.split('20').str[1]
new_data['Year'] = new_data['Year'].str[:2]
However, I also meet names like these: 'TON RO20 2018 N' or TON 2020 N and then it does not work. I also encounter different number of spaces in different rows in the dataframe, hence it does not work to count the number of spaces in the string.
Any smart solutions to my problem?

Use .str.extract() to extract 4 digits string starting with 20 and get the last 2 digits, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'20(\d\d)')
If you want to ensure the 4-digit string is not part of a longer string/number, you can further use regex meta-character \b (word boundary) to enclose the target strings, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'\b20(\d\d)\b')
Demo
Input data:
print(new_data)
Name
0 TON GFR 2018 N
1 TON RO20 2018 N
2 TON 2020 N
Result:
print(new_data)
Name Year
0 TON GFR 2018 N 18
1 TON RO20 2018 N 18
2 TON 2020 N 20

if this is all the time the same distance from the end you could use:
new_data["Year"] = new_data["Name"].str.slice(start=-4, stop=-2)

Related

Creating new dataframes by selecting rows with numbers/digits

I have this small dataframe:
index words
0 home # there is a blank in words
1 zone developer zone
2 zero zero
3 z3 z3
4 ytd2525 ytd2525
... ... ...
3887 18TH 18th
3888 180m 180m deal
3889 16th 16th
3890 150M 150m monthly
3891 10am 10am 20200716
I would like to extract all the words in index which contains numbers, in order to create a dataframe with only them, and another one where words containing numbers in both index and words are selected.
To select rows which contain numbers I have considered the following:
m1 = df['index'].apply(lambda x: not any(i.isnumeric() for i in x.split()))
m2 = df['index'].str.isalpha()
m3 = df['index'].apply(lambda x: not any(i.isdigit() for i in x))
m4 = ~df['index'].str.contains(r'[0-9]')
I do not know which one should be preferred (as they are redundant). But I would also consider another case, where both index and words contain numbers (digits), in order to select rows and create two dataframes.
Your question not clear. Happy to correct if I got the question wrong
For all words in index containing numbers in their own dataframe please try:
df.loc[df['index'].str.contains('\d+'),'index'].to_frame()
and for words containing numbers in both index and words
df.loc[df['index'].str.contains('\d+'),:]

Problem with the date format in a column of my DataFrame

So I have a column which contains dates as string objects, however the dates are not all in the same format. Some are MM/YYYY or YYYY. I would like them to be all YYYY, and then convert them to floating objects. I am trying to use a regular expression to replace these strings but I am having difficulty. The column name is 'cease_date' and the DF is called 'dete_resignations'.
pattern2 = r"(?P<cease_date>[1-2][0-9]{3})?"
years = dete_resignations['cease_date'].str.extractall(pattern2)
print(years['cease_date'].value_counts())
2013 146
2012 129
2014 22
2010 2
2006 1
So from the above the regular expression works, but I have no idea how to get it back into the original dataframe. I tried doing a boolean index but it didn't work. Am I going about this the wrong way?
You can use this regex to extract the last four digits in your strings:
years = dete_resignations['cease_date'].str.extract('(\d{4})$')[0]

Get rid of initial spaces at specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

how to delete whitespace from nd array python

I have yearly data sets with some missing data. I used this code to read but unable to omit white space present at the end of february. can anyone help to solve this problem?
df1 = pd.read_fwf('DQ404.7_77.txt',widths=ws,header=9, nrows=31, keep_default_na = False)
df1 = df1.drop('Day', 1)
df2 = np.array(df1).T
what I want is to arrange all the data in one column with respect to date. My data is uploaded in this link you can download
https://drive.google.com/open?id=0B2rkXkOkG7ExbEVwZUpHR29LNFE
what i wanted is to get time series data from this file and it should be like
Feb,25 13
Feb,26 13
Feb,27 13
Feb,28 13
March, 1 10
March, 2 10
March, 3 10
Not with empty strings in between february and March
So after a lot of comments it looks like df[df != ''] works for you

Categories

Resources