Convert to DateType column - python

I have a column with the values below. How can I add another column with values converted to DateType?

As the front of the string is fixed and the middle of the string is comma-separated, you could use a mix of substr and split to get what you want. Finally use make_date to create the date from the component parts.
A simple example:
import pyspark.sql.functions as F
df2 = df \
.withColumn("xyear", F.split(F.col("col1").substr(19,12),",").getItem(0)) \
.withColumn("xmonth", F.split(F.col("col1").substr(19,12),",").getItem(1)) \
.withColumn("xday", F.split(F.col("col1").substr(19,12),",").getItem(2)) \
.withColumn("md2", F.expr("make_date(xyear, xmonth, xday)"))
df2.show()
My results:
You could also look at RegEx to split the string. Some good examples here. I'd be interested to see if there was a more Pythonic way of doing it.

Related

How to use a Pandas column inside a regex in .str.extract() function

I have a dataframe with two columns. I need to use the content of first column to search at second column and capture some content based on regex.
column1
column2
key1
word1/word2/key1
key2
word3/word4/word5/key2
key3
word6/key3/word7
I need to search "key1" in word1/word2/key1 and capture the string between the two "/" before "key1". In this example, on the first row, i need to capture "word2", on the second row, i need to capture "word5" and on the third row i need to capture word6.
I don't know how to pass "column1" as a variable inside regex in .str.extract(r'/(\w{1,})/"column1"').
Can anyone help me please?
Building off Tim Roberts' comment, you can write a function that acts on a row of your data frame. I suggest avoiding the use of a regular expression if you can. Assuming that there are always words delimited by a forward slash (/) in column 2, then you could do something like this.
import pandas as pd
import re
df = pd.DataFrame({
"column1": ["key1", "key2"],
"column2": ["word/word2/key1", "word3/word4/key2"]
})
def extract_preceding_word(row):
query, string = row
parts = string.split("/")
idx = parts.index(query)
return parts[idx - 1]
new_df = df.assign(
new_column2=lambda DF: DF.apply(extract_middle_word, axis=1)
)
print(new_df)
The result is
column1 column2 new_column2
0 key1 word/word2/key1 word2
1 key2 word3/word4/key2 word4
A few things to point out here.
If the first column might contain characters that have special meaning for a regular expression, you will need to take care to escape those when generating the regex.
For example, could column1 have a value like key1(foo) ?
I would suggest avoiding having the data in a pandas.DataFrame, if you can. This problem might be easier to solve with two lists instead of two columns.
If you really need a regular expression, I recommend using regex101 to develop your pattern. To get you started, a lookahead assertion may be what you need. For example
def extract_preceding_word_regex(row):
query, string = row
pat = rf"([^/]+)(?=/{query})"
out = re.sub(pat, r"\1", string)
return out

Extract substring from left to a specific character for each row in a pandas dataframe?

I have a dataframe that contains a collection of strings. These strings look something like this:
"oop9-hg78-op67_457y"
I need to cut everything from the underscore to the end in order to match this data with another set. My attempt looked something like this:
df['column'] = df['column'].str[0:'_']
I've tried toying around with .find() in this statement but nothing seems to work. Anybody have any ideas? Any and all help would be greatly appreciated!
You can try .str.split then access the list with .str or with .str.extract
df['column'] = df['column'].str.split('_').str[0]
# or
df['column'] = df['column'].str.extract('^([^_]*)_')
print(df)
column
0 oop9-hg78-op67
df['column'] = df['column'].str.extract('_', expand=False)
could also be used if another option is needed.
Adding to the solution provided above by #Ynjxsjmh
You can use str.extract:
df['column'] = df['column'df].str.extract(r'(^[^_]+)')
Output (as separate column for clarity):
column column2
0 oop9-hg78-op67_457y oop9-hg78-op67
Regex:
( # start capturing group
^ # match start of string
[^_]+ # one or more non-underscore
) # end capturing group

How to delete junk strings appearing in an integer column

I have a column of integers (sample row: 123456789) and some of the values are interspersed with junk alphabets. Ex: 1234y5678. I want to delete the alphabets appearing in such cells and retain the numbers. How do I go about it using Pandas?
Assume my dataframe is df and the column name is mobile.
Should I use np.where with conditions such as df[df['mobile'].str.contains('a-z')] and use string replace?
If your junk characters are not limited to letters, you should use this:
yourSeries.str.replace('[^0-9]', '')
Use pd.Series.str.replace:
import pandas as pd
s = pd.Series(['125109a181', '1361q1j1', '85198m4'])
s.str.replace('[a-zA-Z]', '').astype(int)
Output:
0 125109181
1 136111
2 851984
Use the regex character class \D (not a digit):
df['mobile'] = df['mobile'].str.replace('\D', '').astype('int64')

How can I replace multiple characters from all columns of a spark dataframe?

I have a dataframe containing multiple columns.
>>> df.take(1)
[Row(A=u'{dt:dt=string, content=Prod}', B=u'{dt:dt=string, content=Staging}')]
I want to remove both curly braces '{' and '}' from values of column A and B of df. I know we can use:
df.withColumn('A',regexp_replace('A','//{',''))
df.withColumn('A',regexp_replace('A','//}',''))
df.withColumn('B',regexp_replace('B','//}',''))
How do I replace characters dynamically for all columns of Spark dataframe? (Pandas version is shown below)
df = df.replace({'{':'','}':''},regex=True)
Just use proper regular expression:
df.withColumn("A", regexp_replace("A", "[{}]", ""))
To modify a dataframe df and apply regexp_replace to multiple columns given by listOfColumns you could use foldLeft like so:
val newDf = listOfColumns.foldLeft(df)((acc, x) => acc.withColumn(x, regexp_replace(col(x), ..., ...)))

Trying to remove commas and dollars signs with Pandas in Python

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place
import pandas as pd
import pandas_datareader.data as web
players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')
df1 = pd.DataFrame(players[0])
df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')
print (df1.head(10))
You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '')
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace('$', '')
df1['Avg_Annual'] = df1['Avg_Annual'].astype(int)
alternately;
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '').str.replace('$', '').astype(int)
if you want to prioritize time spent typing over readability.
Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.
# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']
# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
#shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).
This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.
#bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.
Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.
df[col] = df[col].astype(str) # cast to string
# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '') # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')
df[col] = df[col].astype(float) # cast back to appropriate type
This worked for me. Adding "|" means or :
df['Salary'].str.replace('\$|,','', regex=True)
I used this logic
df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
When I got to this problem, this was how I got out of it.
df['Salary'] = df['Salary'].str.replace("$",'').astype(float)

Categories

Resources