This question already has answers here:
How to merge with wildcard? - Pandas
(2 answers)
Closed 27 days ago.
I have two data frame where I have * asterisk wild card in string which need to be compared with full string such that * asterisk will match any number of characters.
This same is working in Excel vlookup using below fomula, need same to be done using Python / Pandas Dataframe tried using pd.merge but * asterisk is considered as String in python instead of wildcard
Left_dataframe
Right_dataframe
Formula in excel working
Output expected
Compare * data
Compare with the data
VLOOKUP(A2,B:B,1,0)
Compare with the data
Tried using using Python / Pandas Dataframe tried using pd.merge but * asterisk is considered as String in python instead of wildcard and merge in not working
It is difficult to propose a solution. In the absence of 'min reproducible example', you might consider the following:
#import libraries
import pandas as pd
import re ##not used initially
#create dataframes
left_data = {'Left_dataframe': 'Compare * data'}
right_data = {'Right_dataframe': 'Compare with the data'}
df_left = pd.DataFrame(left_data, index=[0])
df_right = pd.DataFrame(right_data, index=[0])
#check df
df_left
df_right
#compare cell value | escape *
df_output_where = df_right.where(df_right.Right_dataframe.str.contains('/*'), df_left)
df_output_where
'''
[Disclaimer] Below might not be the most pythonic way.
The below add-on checks for the '*' asterisk within the first df, and in addition,
check if the first 'word' in the two df matches.
If so, it returns the value in the second df.
'''
## pattern to use
str_pattern5 = r'(\w+) (.*) (\w+)'
## startswith first word in column (returned by re.match())
## assist from https://stackoverflow.com/a/56073916/20107918
## assist from https://stackoverflow.com/a/61713961/20107918
compare_left = df_left.Left_dataframe.apply(lambda x: x.startswith(re.match(str_pattern5, (df_left.Left_dataframe.values)[0]).group(1))) #.group(1))
compare_right = df_right.Right_dataframe.apply(lambda x: x.startswith(re.match(str_pattern5, (df_right.Right_dataframe.values)[0]).group(1))) #.group(1))
## return value from the right dataframe where
## the right df, compare_right == compare_left
df_output_compare02 = df_right.Right_dataframe if ((df_left.Left_dataframe.str.contains('/*')).all() and (compare_right == compare_left).all()) else None
df_output_compare02
#### NB:
#df_output_where_compare = df_right.where(((df_left.Left_dataframe.str.contains('/*')) and (compare_right == compare_left))) ##truth value
#df_output_where_compare = df_right.Right_dataframe.where((df_left.Left_dataframe.str.contains('/*')).all() and (compare_right == compare_left).all(), df_left) #ValueError: Array conditional must be same shape as self
#df_outpute_where_compare
PS: Below does not return the desired output
#### PS: does not return the desired output
#perform **pd.merge**
df_output = df_left.merge(df_right, how='outer', left_on='Left_dataframe', right_on='Right_dataframe')
df_output
Further consideration:
string
If performing a check on * as a pattern, do it as regex.
You may import re for better regex handling.
Related
This seems like it should be pretty simple, but I'm stumped for some reason. I have a list of PySpark columns that I would like to sort by name (including aliasing, as that will be how they are displayed/written to disk). Here's some example tests and things I've tried:
def test_col_sorting():
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
# Active spark context needed
spark = SparkSession.builder.getOrCreate()
# Data to sort
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]
# Attempt 1
result = sorted(cols)
# This fails with ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
# Attempt 2
result = sorted(cols, key=lambda x: x.name())
# Fails for the same reason, `name()` returns a Column object, not a string
# Assertion I want to hold true:
assert result = [f.col('a'), f.col('c'), f.col('b').alias('z')]
Is there any reasonable way to actually get the string back out of the Column object that was used to initialize it (but also respecting aliasing)? If I could get this from the object I could use it as a key.
Note that I am NOT looking to sort the columns on a DataFrame, as answered in this question: Python/pyspark data frame rearrange columns. These Column objects are not bound to any DataFrame. I also do not want to sort the column based on the values of the column.
Answering my own question: it seems that you can't do this without some amount of parsing from the column string representation. You also don't need regex to handle this. These two methods should take care of it:
def get_column_name(col: Column) -> str:
"""
PySpark doesn't allow you to directly access the column name with respect to aliases
from an unbound column. We have to parse this out from the string representation.
This works on columns with one or more aliases as well as unaliased columns.
Returns:
Col name as str, with respect to aliasing
"""
c = str(col).lstrip("Column<'").rstrip("'>")
return c.split(' AS ')[-1]
def sorted_columns(cols: List[Column]) -> List[Column]:
"""
Returns sorted list of columns, with respect to aliases
Args:
cols: List of PySpark Columns (e.g. [f.col('a'), f.col('b').alias('c'), ...])
Returns:
Sorted list of PySpark Columns by name, with respect to aliasing
"""
return sorted(cols, key=lambda x: get_column_name(x))
Some tests to validate behavior:
import pytest
from pyspark.sql import SparkSession
#pytest.fixture(scope="session")
def spark() -> SparkSession:
# Provide a session spark fixture for all tests
yield SparkSession.builder.getOrCreate()
def test_get_col_name(spark: SparkSession):
col = f.col('a')
actual = get_column_name(col)
assert actual == 'a'
def test_get_col_name_alias(spark: SparkSession):
col = f.col('a').alias('b')
actual = get_column_name(col)
assert actual == 'b'
def test_get_col_name_multiple_alias(spark: SparkSession):
col = f.col('a').alias('b').alias('c')
actual = get_column_name(col)
assert actual == 'c'
def test_sorted_columns(spark: SparkSession):
cols = [f.col('z').alias('c'), f.col('a'), f.col('d').alias('e').alias('f'), f.col('b')]
actual = sorted_columns(cols)
expected = [f.col('a'), f.col('b'), f.col('z').alias('c'), f.col('d').alias('e').alias('f')]
# We can't directly compare lists of cols, so we zip and check the repr of each element
for a, b in zip(actual, expected):
assert str(a) == str(b)
I think it's fair to say being unable to access this information in a truthy way is a failure of the PySpark API. There are a multitude of valid reasons to want to ascertain what name an unbound Column type would be resolved to, and it should not have to be parsed in such a hacky way.
If you're only interested in grabbing the column names, and sorting those (without any relation to any data) you can use the column object's __repr__ method and use regex to extract the actual name of your column.
So for these columns
import pyspark.sql.functions as f
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]
You could do this:
import re
# Making a list of string representation of our columns
col_repr = [x.__repr__() for x in cols]
["Column<'c'>", "Column<'a'>", "Column<'b AS z'>"]
# Using regex to extract the interesting part of the column name
# while making sure we're properly grabbing the alias name. Notice
# that we're grabbing the right part of the column name in `b AS z`
col_names = [re.search('([a-zA-Z])\'>', x).group(1) for x in col_repr]
['c', 'a', 'z']
# Sorting this array
sorted_col_names = sorted(col_names)
['a', 'c', 'z']
NOTE: This example is simple (only accepting lowercase and uppercase letters as column names) but as your column names get more complex, it's just a question of adapting your regex pattern.
This question already has answers here:
Remove non-ASCII characters from pandas column
(8 answers)
Closed 1 year ago.
In my DF there are values like الÙجيرة in different columns. How can I remove such values? I am reading the data from an excel file. So on reading, if we could do something then that will be great.
Also, I have some values like Battery ÁÁÁ so I want it to be Battery, So how can I delete these non-English characters but keep other content?
You can use regex to remove designated characters from your strings:
import re
import pandas as pd
records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)
# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)
# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)
name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery
For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string
You can use python split method to do that or you can lambda function:
df[column_name] = df[column_name].apply(lambda column_name : column_name[start:stop])
#df['location'] = df['location'].apply(lambda location:location[0:4])
Split Method
df[column_name] = df[column_name].apply(lambda column_name: column_name.split('')[0])
I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.
You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])
You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)
I’m trying to import an excel file and search for a specific record
Here’s what I have come up with so far, which keeps throwing error.
The excel spread sheet has two columns Keyword and Description, each keyword is around 10 characters max, and description is around 150 characters max.
I can print the whole sheet in the excel file without any errors using print(df1) but as soon as I try to search for a specific value it errors out.
Error
ValueError: ('Lengths must match to compare', (33,), (1,))
Code
import pandas as pd
file = 'Methods.xlsx'
df = pd.ExcelFile(file)
df1 = df.parse('Keywords')
lookup = df1['Description'].where(df1['Keyword']==["as"])
print (lookup)
the filter syntax is like this
df_filtered = df[df[COLUMN]==KEYWORD]
so in your case it'd be
lookup = df1[df1['Keyword'] == "as"]['Description']
or the whole code
import pandas as pd
file = 'Methods.xlsx'
df = pd.ExcelFile(file)
df1 = df.parse('Keywords')
lookup = df1[df1['Keyword'] == "as"]['Description']
print (lookup)
breaking it down:
is_keyword = df1['Keyword'] == "as"
this would return a series containing True or False depending on if the keyword was present.
then we can filter the dataframe to get those rows that have True with.
df_filtered = df1[is_keyword]
this will result in all the columns, so to get only the Description column we get it by
lookup = df_filtered['description']
or in one line
lookup = df1[df1['Keyword'] == "as"]['Description']
adding to the elaborate answer given by #Jimmar above:
Just for syntactical convenience, you could write the code like this:
lookup = df1[df1.keyword == "as"].Description
Pandas provides column name lookup like it is a member of DataFrame class( use of dot notation). Please note that the for using this way the column names should not have any spaces in them
This question already has an answer here:
pd.to_numeric converts entire series to NaN
(1 answer)
Closed 5 years ago.
An accounting format for numeric values usually uses a currency character, and often uses parentheses to represent negative values. Zero may also be represented as a - or $-. When such a series is imported into a Pandas DataFrame it is an object type. I need to convert it to a numeric type and parse the negative values correctly.
Here's an example:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'A':['123.4', '234.5', '345.5', '456.7'],
'B':['$123.4', '$234.5', '$345.5', '$456.7'],
'C':['($123.4)', '$234.5', '($345.5)', '$456.7'],
'D':['$123.4', '($234.5)', '$-', '$456.7']})
Series A is easy to convert e.g.
df['A'] = df['A'].astype(float)
Series B required the removal of the $ sign, after which it is then straightforward.
Then comes series C and D. They contain parentheses (i.e. negative) values and D contains $- for zero. How can I correctly parse theses series into numeric series / dataframe?
import numpy as np
def pd_columntonumbeR(df, colname):
for c in colname:
df[c] = np.vectorize(replacetonumbeR)(df[c])
df[c].fillna(0, inplace=True)
df[c] = pd.to_numeric(df[c])
def replacetonumbeR(s):
if type(s).__name__ == "str":
s = s.strip()
if s == "-":
s = 0
else:
s = s.replace(",","").replace("$","")
if s.find("(") >= 0 and s.find(")") >= 0:
s = s.replace("(","-").replace(")","")
return s
I'd use the Pandas replace function to replace $ and ) by nothing, replace - by 0, and then finally replace ( by -. Then you can do df=astype(float) and it should work.