regex for string dates in dataframe

regex for string dates in dataframe - python

let's say I have a dataframe with a value strings that looks like like:
[26.07. - 08.09.]
and I want to add '2018' behind the last '.' before the date ends such that my output will be :
[26.07.2018 - 08.09.2018]
and apply this for the rest of the dataframe which basically has the same format.
so far I have the code:
df.iloc[:,1].replace('.','2018',regex=True)
how can I change my code such that it will work as I desire?
I am doing this so that eventually I will be able to transform these into dates that can count how many days are there between the two dates.

a = '[26.07. - 08.09.]'
aWithYear = [i[:-1]+'2018'+i[-1] for i in a.split('-')]
print('-'.join(aWithYear))
# prints [26.07.2018 - 08.09.2018]

If you have, for example,
df = pd.DataFrame({'col': ['[05.07. - 18.08.]', '[05.07. - 18.09.]']})
col
0 [05.07. - 18.08.]
1 [05.07. - 18.09.]
You can split and concat the str.get(0) and str.get(1) values
vals = df.col.str.strip('[]').str.split("- ")
get = lambda s: vals.str.get(s).str.strip() + '2018'
df['col'] = '[' + get(0) + ' - ' + get(1) + ']'
col
0 [05.07.2018 - 18.08.2018]
1 [05.07.2018 - 18.09.2018]

Related

How to concatenate series in Python

I've the following code:
def excel_date(date1):
temp = datetime.datetime(1899, 12, 30)
delta = date1 - temp if date1 != 0 else temp - temp
return float(delta.days) + (float(delta.seconds) / 86400)
df3['SuperID'] = df3['Break_date'].apply(excel_date)
df3['SuperID2'] = df3['ticker'] + str(df3['SuperID'])
Where I use a date to insert in date1 and I get a number from the excel date function.
My ticker and SuperID fields are OK:
I want to concatenate both and get TSLA44462 BUT it's concatenating the whole series if I use str() or .astype(str) in my SuperID column.
The column types:

Here my solution if I understood your problem :
import pandas as pd
df = pd.DataFrame({"Col1":[1.0,2.0,3.0,4.4], "Col2":["Michel", "Sardou", "Paul", "Jean"], "Other Col":[2,3,5,2]})
df["Concat column"] = df["Col1"].astype(int).astype(str) + df["Col2"]
df[df["Concat column"] == "1Michel"]
or
df = pd.DataFrame({"Col1":[1.0,2.0,3.0,4.4], "Col2":["Michel", "Sardou", "Paul", "Jean"], "Other Col":[2,3,5,2]})
df[(df["Col1"]==1) & (df["Col2"]=="Michel")]

After some hours of investigation and the help of comments the way to work with series, integers, floats and strings which worked for me is this:
def excel_date(date1):
temp = datetime.datetime(1899, 12, 30)
delta = date1 - temp if date1 != 0 else temp - temp
return float(delta.days) + (float(delta.seconds) / 86400)
First of all I convert float to integer to avoid decimals. int(x) is not feasible for series, so you better use .astype(int) which works fine.
df3['SuperID'] = df3['Break_date'].apply(excel_date).astype(int)
After that, convert everything to char with char.array and not str(x) or .astype. You then just need to sum columns using .astype(str) to get the desired result.
a = np.char.array(df3['ticker'].values)
b = np.char.array(df3['SuperID'].values)
df3['SuperID2'] = (a + b).astype(str)
Hope this help to others working with series.
regards

conditionally replacing values in a column

I have a pandas dataframe, where the 2nd, 3rd and 6th columns look like so:
start
end
strand
108286
108361
+
734546
734621
-
761233
761309
+
I'm trying to implement a conditional where, if strand is +, then the value in end becomes the equivalent value in start + 1, and if strand is -, then the value in start becomes the value in end, so the output should look like this:
start
end
strand
108286
108287
+
734620
734621
-
761233
761234
+
And where the pseudocode may look like this:
if df["strand"] == "+":
df["end"] = df["start"] + 1
else:
df["start"] = df["end"] - 1
I imagine this might be best done with loc/iloc or numpy.where? but I can't seem to get it to work, as always, any help is appreciated!

You are correct, loc is the operator you are looking for
df.loc[df.strand=='+','end'] = df.loc[df.strand=='+','start']+1
df.loc[df.strand=='-','start'] = df.loc[df.strand=='-','end']-1

You could also use numpy.where:
import numpy as np
df[['start', 'end']] = np.where(df[['strand']]=='-', df[['end','end']]-[1,0], df[['start','start']]+[0,1])
Note that this assumes strand can have one of two values: + or -. If it can have any other values, we can use numpy.select instead.
Output:
start end strand
0 108286 108287 +
1 734620 734621 -
2 761233 761234 +

How to add a prefix to a string if it ends with a particular character (Pandas) i.e. add '-' to string given it ends with '-'

For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)

Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-

You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-

def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object

I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)

Conditional replace comma or spaces in number string in Pandas DataFrame column without a loop

Sometimes the string numbers in my DataFrames have commas in them representing either decimal or marking the thousand, some do not. The dataframe is an example of the range of price formats I receive via an API and vary depend on the currency. These are prices and the decimals will always be 2. So I need to output the string prices into float so I can sum them or separate them into other dataframes or use them for plotting graphs. I have created a loop to replace them, but is there a quicker way to do this without the loop?
My DataFrame and working loop is as follows:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
fees = []
sales = []
for items in df['amount']:
if items[-7:-6] == ',':
items = float(items.replace(',', '').replace(' ',''))
if items[-3:-2] == ',':
items = float(items.replace(',', '.').replace(' ',''))
items = float(items)
if items <= 0:
fees.append(items)
else:
sales.append(items)
I have attempted to do this without the loop but can't seem to work out where I have gone wrong.
df["amount"] = np.where((df['amount'][-7:-6] == ','),
df["amount"][-7:-6].str.replace(',', '').replace(' ',''),
df["amount"])
df["amount"] = np.where((df['amount'][-3:-2] == ','),
df["amount"][-3:-2].str.replace(',', '').replace(' ',''),
df["amount"])
Any help would be much appreciated. Thank you in advance

Since you mention the last two digits are decimal points, so the ',' needs to be replaced with '.' to make it float, but you also have some values like 1,000.00 that will become irrelevant if the ',' is replaced with '.', hence you can use a regex to identify what values to be replaced:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
df
First the regex will match all string with ',' and two decimal points, then the replace function will replace the match with a '.' & the captured values (99 from ,99)
df['amount'] = df['amount'].str.replace(r'(,)(\d{2}$)',r'.\2')
# here `r'.\2'`is second `captured group` in `regex`
Then to convert 1,000.00 to float we will replace the ',' with blank
df['amount'] = df['amount'].str.replace(',','')
And then convert the data type to float
df['amount'] = df['amount'].astype(float)
print(df)
amount
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00

You can use lambdas instead of numpy:
lambda1 = lambda items: float(str(items).replace(',', '').replace(' ','')) if str(items)[-7:-6] == ',' else items
lambda2 = lambda items: float(str(items).replace(',', '.').replace(' ','')) if str(items)[-3:-2] == ',' else items
to_float = lambda items: float(items)
df['amount_clean'] = df["amount"].map(lambda1).map(lambda2).map(to_float)
=========================================================================
Edit: what are lambdas
In python, lambda functions are small anonymous functions with a single expression (see https://www.w3schools.com/python/python_lambda.asp)
Example with condition:
lambda x: x + 1 if x < 0 else x
This is equivalent to:
def my_lambda_function(x):
if x < 0:
return x + 1
else:
return x
When passed to the column of a pandas dataframe via the map function, the lambda expression will be applied to the value in each row of the column.
Hope this helps!

Try using split and join,
df.amount.str.split(',').str.join('').astype(float)
Output
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00
Name: amount, dtype: float64

Modify DataFrame column values based on condition

I am trying to modify the formatting of the strings of a Datframe column according to a condition.
Here is an example of the file
The DataFrame
Now, as you might see, the object column values either start with http or a capital letter: I want to make it so that:
if the string starts with http, I put it between <>
if the string starts with a capital letter, I format it as " + string + " + '#en'
However, I cant seem to be able to do so: I tried to make a simple if condition with .startswith(h) or contains('http') but it doesn't work, because I understand that it actually returns a list of booleans instead of a single condition.
Maybe it is very simple but I cannot solve, any help is appreciated.
Here is my code
import numpy as np
import pandas as pd
import re
ont1 = pd.read_csv('1.tsv',sep='\t',names=['subject','predicate','object'])
ont1['subject'] = '<' + ont1['subject'] + '>'
ont1['predicate'] = '<' + ont1['predicate'] + '>'

So it looks like you have many of the right pieces here, you mentioned boolean indexing which is what you can use to select and update certain rows, for example I'll do this on a dummy DataFrame:
df = pd.DataFrame({"a":["http://akjsdhka", "Helloall", "http://asdffa", "Bignames", "nonetodohere"]})
First we can find rows starting with "http":
mask = df["a"].str.startswith("http")
df.loc[mask, "a"] = "<" + df["a"] + ">"
Then we update the rows where that mask is true, and the same for the other condition:
mask2 = df["a"].str[0].str.isupper()
df.loc[mask2, "a"] = "\"" + df["a"] + "\"#en"
Final result:
a
0 <http://akjsdhka>
1 "Helloall"#en
2 <http://asdffa>
3 "Bignames"#en
4 nonetodohere

Try:
ont1.loc[['subject'].str.startsWith("http"),'subject'] = "<" + ont1 ['subject'] + ">"
Ref to read:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex for string dates in dataframe - python

a = '[26.07. - 08.09.]' aWithYear = [i[:-1]+'2018'+i[-1] for i in a.split('-')] print('-'.join(aWithYear)) # prints [26.07.2018 - 08.09.2018]

Related

How to concatenate series in Python

conditionally replacing values in a column

How to add a prefix to a string if it ends with a particular character (Pandas) i.e. add '-' to string given it ends with '-'

Conditional replace comma or spaces in number string in Pandas DataFrame column without a loop

Modify DataFrame column values based on condition

Categories

Resources