I'm using Pandas to load an Excel spreadsheet which contains zip code (e.g. 32771). The zip codes are stored as 5 digit strings in spreadsheet. When they are pulled into a DataFrame using the command...
xls = pd.ExcelFile("5-Digit-Zip-Codes.xlsx")
dfz = xls.parse('Zip Codes')
they are converted into numbers. So '00501' becomes 501.
So my questions are, how do I:
a. Load the DataFrame and keep the string type of the zip codes stored in the Excel file?
b. Convert the numbers in the DataFrame into a five digit string e.g. "501" becomes "00501"?
As a workaround, you could convert the ints to 0-padded strings of length 5 using Series.str.zfill:
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
Demo:
import pandas as pd
df = pd.DataFrame({'zipcode':['00501']})
df.to_excel('/tmp/out.xlsx')
xl = pd.ExcelFile('/tmp/out.xlsx')
df = xl.parse('Sheet1')
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
print(df)
yields
zipcode
0 00501
You can avoid panda's type inference with a custom converter, e.g. if 'zipcode' was the header of the column with zipcodes:
dfz = xls.parse('Zip Codes', converters={'zipcode': lambda x:x})
This is arguably a bug since the column was originally string encoded, made an issue here
str(my_zip).zfill(5)
or
print("{0:>05s}".format(str(my_zip)))
are 2 of many many ways to do this
The previous answers have correctly suggested using zfill(5). However, if your zipcodes are already in float datatype for some reason (I recently encountered data like this), you first need to convert it to int. Then you can use zfill(5).
df = pd.DataFrame({'zipcode':[11.0, 11013.0]})
zipcode
0 11.0
1 11013.0
df['zipcode'] = df['zipcode'].astype(int).astype(str).str.zfill(5)
zipcode
0 00011
1 11013
Related
I need to read a worksheet and organize the numbers that are without punctuation.
For example, I need this number: 12345678901234567890, look like this: 1234567-89.0123.4.56.7890
In Js I did something like this in a google sheets spreadsheet and it worked, but I need to embed this in a python script. I'm using pandas to read the spreadsheet. I'm having trouble mounting in python
function reorder() {
const a = ["12345678901234567890","12345678901234567890"];//put all of the strings into an array
const b = a.map(s => {
return `${s.slice(0,7)}-${s.slice(7,9)}.${s.slice(9,13)}.${s.slice(13,14)}.${s.slice(14,16)}.${s.slice(16)}`
})
}
Considering that the dataframe that OP is reading from excel looks like this
df = pd.DataFrame({'string': ['12345678901234567890', '12345678901234567890']})
[Out]:
string
0 12345678901234567890
1 12345678901234567890
There are various ways to achieve OP's goal.
One, for example, is using pandas.Series.apply with a lambda function as follows
df['string'] = df['string'].apply(lambda x: f"{x[0:7]}-{x[7:9]}.{x[9:13]}.{x[13:14]}.{x[14:16]}.{x[16:]}")
[Out]:
string
0 1234567-89.0123.4.56.7890
1 1234567-89.0123.4.56.7890
One can also use pandas.DataFrame.query as
df['string'] = df.query('string.str.contains(r"\d{7}-\d{2}\.\d{4}\.\d\.\d{2}\.\d{2}")', engine='python')
[Out]:
string
0 1234567-89.0123.4.56.7890
1 1234567-89.0123.4.56.7890
Notes:
One might have to adjust the column name (in this case it is string) and/or the dataframe (in this case it is df).
I have a dataframe where one column contains numbers but as string values like "1.0", "52.0" etc.
I want to convert the column to instead contain strings like "PRE_1", "PRE_52".
Example
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
df["pre"] = 'PRE_' + df["pre"].astype(str)
gives me output of PRE_1.0
I tried:
df["pre"] = 'PRE_' + df["pre"].astype(int).astype(str) but got a ValueError.
Do I need to convert it into something else before trying to convert it to an int?
It looks like: df["pre"].astype(float).astype(int).astype(str) might do what I want but I'm open to cleaner ways of doing it.
I'm pretty new to pandas, so help would be greatly appreciated!
To properly be able to help, having sample data would be great. Based on the information you did provide, if the data coming in is a float, you can apply a format to truncate it as below.
df = pd.DataFrame({'pre': [1.0, 52.0]})
df['pre'] = df['pre'].map('PRE_{:.0f}'.format)
print(df)
Apply a function:
import pandas as pd
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
print(df)
df.Pre = df.Pre.apply(lambda n: f'PRE_{float(n):.0f}')
print(df)
Output:
Pre
0 1.0
1 52.0
Pre
0 PRE_1
1 PRE_52
I have a dataframe with one column named "metadata" in unicode format, as it can be seen below:
print(df.metadata[1])
u'{"vehicle_year":2010,"issue_state":"RS",...,"type":4}'
type(df.metadata[1])
unicode
I have other column in this dataframe named 'issue_state_update' and I need to change the values from issue state from what is in the metadata to the data in the metadata's row in 'issue_state_update' column.
I have tried to use the following:
for i in range(len(df_final['metadata'])):
df_final['metadata'][i] = json.loads((df_final['metadata'][i]))
json_dumps(df_final['metadata'][i].update({'issue_state': df_final['issue_state_update'][i]}),ensure_ascii=False).encode('utf-8')
However what I get is an error:
TypeError: expected string or buffer
What I need is to have exactly the same format as before doing this change, but with the new info associated with 'issue_state'
For example:
u'{"vehicle_year":2010,"issue_state":"NO STATE",...,"type":4}'
I'm assuming you have a DataFrame (DF) that looks something like:
screenshot of a DF I mocked up
Since you're working with a DF you should manipulate the data as a vector instead of iterating over it like in standard Python. One way to do this is by defining a function and then "applying" it to your data. Something like:
def parse_dict(x):
x['metadata']['issue_state'] = x['issue_state_update']
Then you could apply it to every row in your DataFrame using:
some_df.apply(parse_dict, axis=1)
After running that code I get an updated DF that looks like:
updated DF where dict now has value from 'issue_state_update'
Actually I have found the answer. I don't know how efficient it is, but it works. Here it goes:
def replacer(df):
df_final = df
import unicodedata
df_final['issue_state_upd'] = ""
for i in range(len(df_final['issue_state'])):
#From unicode to string
df_final['issue_state_upd'][i] = unicodedata.normalize('NFKD', df_final['issue_state'][i]).encode('ascii','ignore')
#From string to dict
df_final['issue_state_upd'][i] = json.loads((df_final['issue_state_upd'][i]))
#Replace value in fuel key
df_final['issue_state_upd'][i].update({'fuel_type': df_final['issue_state_upd'][i]})
#From dict to str
df_final['issue_state_upd'][i] = json.dumps(df_final['issue_state_upd'][i])
#From str to unicode
df_final['issue_state_upd'][i] = unicode(df_final['issue_state_upd'][i], "utf-8")
return df_final
I have a csv file with two formatted columns that currently read in as objects:
contains percentage values which read in as strings like '0.01%'. The % is always at the end.
contains currency values which read in as string like '$1234.5'.
I have tried using the split function to remove the % or $ inside the dataframe, then using float on the result of the split. This will print the correct result but will not assign the value. It also gives a type error that float does not have split function, even though I do the split before the float????
Try this:
import pandas as pd
df = pd.read_csv('data.csv')
"""
The example df looks like this:
col1 col2
0 3.04% $100.25
1 0.15% $1250
2 0.22% $322
3 1.30% $956
4 0.49% $621
"""
df['col1'] = df['col1'].str.split('%', expand=True)[[0]]
df['col2'] = df['col2'].str.split('$', 1, expand=True)[[1]]
df[['col1', 'col2']] = df[['col1', 'col2']].apply(pd.to_numeric)
You are probably looking for the apply method.
With
df['first_col'] = df['first_col'].apply(lambda x: float(x.strip('%'))
Is there a simple way to drop rows containing a non-integer cell value, then/and convert strings to integers, then sort ascending? I have dataset (single column of what's supposed to be just record numbers) that has strings that I want to remove. This code seems to work, but then sorting seems to sort as if "float" is "string." For example, the record numbers are sorted like so:
0
1
2
200000000
201
3
Code:
import pandas
with open('GridExport.csv') as incsv:
df1 = pandas.read_csv(incsv, usecols=['Record Number'])
cln = pandas.DataFrame()
cln['Record Number'] = [x for x in df1['Record Number'] if x.isdigit()]
cln.astype(float)
print(cln.sort(['Record Number']))
Is there a way to do this without converting to float first? I'd like to drop the numbers that don't fit into int64
The problem in your code is that the line
cln['Record Number'].astype(float)
does not modify the data frame. Consequently, it treats the column as of type string and sorts it accordingly. If you print cln['Record Number'].dtype
after the statement, it should make it clear.
If you would like to modify it, you should do the assignment
cln['Record Number'] = cln['Record Number'].astype(float)
You may convert all string elements into float elements and conduct the following method for sorting
def numeric_compare(x, y):
return float(x)-float(y)
>>> sorted(['10.0','2000.0','30.0'],cmp=numeric_compare)
['10.0', '30.0', '2000.0']