I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003
Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)
Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]
Related
I have a dataset with column of type StringType(). The values in these money columns contains abbreviations like K and M.
I would like to remove 'K' and 'M' and multiple them either by 1000 or 1000000 for K/M respectively. I tried creating a function and use it to add a new column in the dataframe. I keep getting the following error
ValueError: Cannot convert column into bool: please use '&' for 'and',
'|' for 'or', '~' for 'not' when building DataFrame boolean
expressions.
The Column values are as follows:
def convertall(ReleaseClause):
if ReleaseClause == None:
return 0
elif expr("substring(ReleaseClause,-1,length(ReleaseClause))")=='K':
remove_euro=expr("substring(ReleaseClause,2,length(ReleaseClause))")
remove_K=translate(remove_euro,'K','')
remove_Kint=remove_K.cast(IntegerType())*lit(1000)
return remove_Kint
elif expr("substring(ReleaseClause,-1,length(ReleaseClause))")=='M':
remove_euro=expr("substring(ReleaseClause,2,length(ReleaseClause))")
remove_M=translate(remove_euro,'M','')
remove_Mint=remove_M.cast(IntegerType())*lit(1000000)
return remove_Mint
else:
return ReleaseClause
The following code converts the data using F.when() function. Before that, the string is split into letters, then M/K symbol is extracted, as well as the amount to be multiplied. This solution assumes the string size remains the same and the position of M/K symbol as well as the amount data is not variable across rows.
import pyspark.sql.functions as F
data = [("$12.3M",),
("$23.4K",),
("$12.5M",),
("$22.3K",)]
df = spark.createDataFrame(data, schema=["ReleaseClause"])
df_ans = (df
.select("ReleaseClause",
(F.split("ReleaseClause",'').alias("split")))
.withColumn("scale", F.col("split")[5])
.withColumn("amount",
F.concat(F.col("split")[1], F.col("split")[2],
F.col("split")[3], F.col("split")[4])
.cast("double"))
.withColumn("scaled", F.when(F.col("scale")=="K",
F.col("amount")*1000)
.when(F.col("scale")=="M",
F.col("amount")*1000000))
This produces output as
You could check if value contains K or M, extract the number and multiply.
Example:
data = [("$12.3M",), ("$23.4K",), ("$12.5M",), ("$22.3K",)]
df = spark.createDataFrame(data, schema=["ReleaseClause"])
df = df.withColumn(
"result",
F.when(
F.col("ReleaseClause").contains("K"),
F.regexp_extract(F.col("ReleaseClause"), "(\d+(.\d+)?)", 1).cast(DoubleType())
* 1000,
)
.when(
F.col("ReleaseClause").contains("M"),
F.regexp_extract(F.col("ReleaseClause"), "(\d+(.\d+)?)", 1).cast(DoubleType())
* 1000000,
)
.cast(IntegerType()),
)
Result:
+-------------+--------+
|ReleaseClause|result |
+-------------+--------+
|$12.3M |12300000|
|$23.4K |23400 |
|$12.5M |12500000|
|$22.3K |22300 |
+-------------+--------+
For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)
Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-
You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-
def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object
I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)
I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:
0154/0924
0022/0320
whereas in the other data frame they appear as:
154/ 924
22/ 320
the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.
I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.
def Clean_citation(citation):
# Search for opening bracket in the name followed by
# any characters repeated any number of times
if re.search('\(.*', citation):
# Extract the position of beginning of pattern
pos = re.search('\(.*', citation).start()
# return the cleaned name
return citation[:pos]
else:
# if clean up needed return the same name
return citation
df['citation'] = df['citation'].apply(Clean_citation)
Aside: Maybe something relevant- 01 invalid token
My solution:
def convert_str(strn):
new_strn = [s.lstrip("0") for s in strn.split('/')] #to strip only leading 0's
return ('/ ').join(new_strn)
So,
convert_str('0154/0924') #would return
'154/ 924'
Which is in the same format as 'citation' in the other data frame. Could make use of pandas apply function to 'apply' convert_str function on 'citation' column of first dataframe.
Solution
You can use x.str.findall('(\d+)') where x is either the pandas.Dataframe column or a pandas.Series object. You can run this on both columns and extract the true numbers, with each row as a list of two numbers or none (if no number is present.
You could then concatenate the numbers into a single string:
num_pair_1 = df1.Values.str.findall('(\d+)')
num_pair_2 = df2.Values.str.findall('(\d+)')
a = num_pair_1.str.join('/') # for first data column
b = num_pair_2.str.join('/') # for second data column
And now finally compare a and b as they should not have any of those additional zeros or spaces.
# for a series s with the values
s.str.strip().str.findall('(\d+)')
# for a column 'Values' in a dataframe df
df.Values.str.findall('(\d+)')
Output
0 []
1 [154, 924]
2 [22, 320]
dtype: object
Data
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
ss = """
154/ 924
22/ 3
"""
s = pd.Series(StringIO(ss))
df = pd.DataFrame(s.str.strip(), columns=['Values'])
Output
Values
0
1 154/ 924
2 22/ 320
Here's a pattern that would filter both:
pattern = '[0\s]*(\d+)/[0\s]*(\d+)'
s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320'])
s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)')
Output:
0 1
0 154 924
1 22 320
2 154 924
3 22 320
Convert the str to a list by str.split('/') and map to int:
int will remove the leading zeros
If the values in the list are different, df1['citation'] == df2['citation'] will compare as False by row
Requires no regular expressions or list comprehensions
Dataframe setup:
df1 = pd.DataFrame({'citation': ['0154/0924', '0022/0320']})
df2 = pd.DataFrame({'citation': ['154/ 924', '22/ 320']})
print(df1)
citation
0154/0924
0022/0320
print(df2)
citation
154/ 924
22/ 320
Split on / and set type to int:
def fix_citation(x):
return list(map(int, x.split('/')))
df1['citation'] = df1['citation'].apply(fix_citation)
df2['citation'] = df2['citation'].apply(fix_citation)
print(df1)
citation
[154, 924]
[22, 320]
print(df2)
citation
[154, 924]
[22, 320]
Compare the columns:
df1 == df2
I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers
I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?
Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1
Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')
Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1
You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])
Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2