I'm working on a small code generation app that loads in an Excel file (using pandas ExcelFile + xlrd) which is then parsed to a dataframe (ExcelFile.parse) for several SQL-like operations. The stored data is then returned to a file writer as a list using map and lambda functions with a little f-string formatting on the specific fields.
The problem I'm having is that not all fields in the Excel file are predictably populated, so I'm using fillna('') during the parsing to dataframe, but when I come to the f-string, the unpopulated fields will cause an error when I apply :.0f formatting to remove the decimals. If I don't use the fillna('') function, the floats will format correctly, but I then have multiple entries of nan as a string value that I can't work out how to convert to ''.
As an example, the below will fail with fillna('') as NumField3 and NumField 4 can be empty in the source spreadsheet.
return list(
map(
lambda row: f"EXEC ***_****_*.****_Register_File("
f"{row['NumField1']:.0f},{row['NumField2']:.0f},"
f"'{row['TextField1']}','{row['TextField2']}',"
f"'{row['TextField3']}','{row['TextField4']}',"
f"{row['NumField3']:.0f},{row['NumField4']:.0f});\n",
df.to_dict("records")))
My original approach was using .format() and itertuples(), but this was apparently a less efficient way. I've opted for the conversion to dictionary so I can retain the field names in the list construction for easier supportability.
I'm probably missing something really simple, but I can't see the wood for the trees at the moment. Any suggestions?
I think I've worked it out. I've removed the fillna('') from the parsing of the ExcelFile object to dataframe, which results in the NaN value being presented in unpopulated fields. When the dataframe records are eventually processed through the map lambda approach, the original NaN value is presented as the string 'nan', so I've included a re.sub to look for that value as a whole word and replace it with the required empty string.
It's not pretty but it works.
return list(
re.sub(r'\bnan\b', '', i) for i in map(
lambda row: f"EXEC ***_****_*.****_Register_File("
f"{row['NumField1']:.0f},{row['NumField2']:.0f},"
f"'{row['TextField1']}','{row['TextField2']}',"
f"'{row['TextField3']}','{row['TextField4']}',"
f"{row['NumField3']:.0f},{row['NumField4']:.0f});\n",
df.to_dict("records")))
Related
I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()
I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:
How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.
Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)
I have a dataset in csv file which contains one of the column as list(or dict which further includes several semi colons and commas because of key, value pair). Now trouble is accessing with Pandas and it is return mixed values because of the reason that it has several commas in the list which is in fact a single column.
I have seen several solutions such as use "" or ; as delimiter, but problem is I already have the data, find and replace will completely change my dataset.
example of csv is :
data_column1, data_column2, [{key1:value1},{key2:value2}], data_column3
Please advise any faster way to access specific columns of the data with out any ambiguity.
You can only set the delmiter to one character so you can't use square brackets in this Way. You would need to use a single character such as " so that it knows to ignore the commas between the delmieters.
you can try converting the column using melt function. here is the link to the documentation:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html
I am fetching a column from a Dataframe. The column is of string type.
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]" & so on..
The data is stored as a string. It can be easily represented as a list.
I want the output to be:
LIST of [
{somevalues, id:1, name:'xyz'},
{address:Some Value},
{somevalue}
]
How can I achieve this using Spark's API? I know that with Python I can use the eval(x) function and it will return the list or I can use the x.split() function, which will also return a list. However, in this approach, it needs to iterate for each record.
Also, I want to use mapPartition; that is the reason why I need my string column to be in a list so that I can pass it to mapPartition.
Is there an efficient way where I can also convert my string data using spark API or would mapPartitions be even better as I will be looping every partition rather than every record?
You can use regexp_replace to remove the square brackets and then split on the comma. At first, I'd thought you need to do something special to avoid splitting on the commas within the curly brackets. But it seems spark sql automatically avoids that. For example, the following query in Zeppelin
%sql
select split(regexp_replace("[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]", "[\\[\\] ]", ""), ",")
gives me
WrappedArray({somevalues, id:1, name:'xyz'}, {address:SomeValue}, {somevalue})
which is what you want.
You can use withColumn to add a column in this way if you're working with dataframes. And for some reason, if the comma within the curly brackets is being split on, you can do more regex-foo as in this post - Regex: match only outside parenthesis (so that the text isn't split within parenthesis)?.
Hope that makes sense. I'm not sure if you're using dataframes, but they're recommended over the lower level RDD api.
If you don't want to go to dataframes then you can use regex replace and split functions on the rdd data you created .
If you have data as
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]"
Then you can create rdd and use regex replace and split functions as
import re
rdd = sc.parallelize([x]).flatMap(lambda x: re.sub("},\\{", "};&;{", re.sub("[\\[\\]\s+]", "", x)).split(";&;"))
flatMap is used so that the splitted data comes in separate rows as
{somevalues,id:1,name:'xyz'}
{address:SomeValue}
{somevalue}
I hope the answer is helpful
Note : If you want the solution in dataframe way then you can get ideas from my other answer