Convert these Objects to int64 in python columns - python

Another simple question. I have to clean up some data, and a few of the columns need to be in int64 format instead of the objects that they are now (example provided). how would I go about uniformly re-formatting these columns.
print(data.Result)
0 98.8 PG/ML H
1 8.20000
2 26.8 PG/ML H
3 40.8 PG/ML H
4 CREDIT
5 15.30000

You could parse with regex:
import re
def parse_int(s):
"""
A fast memoized function which builds a lookup dictionary then maps values to the series
"""
map_dict = {x:float(re.findall('[0-9.]+',x)[0]) for x in s.unique() if re.search('[0-9.]+',x)}
return s.map(map_dict)
data['Result'] = parse_int(data['Result'])
The function above takes all the unique values from the series and pairs them with its float equivalent. This is an extremely efficient approach in the case of repeated values. The function then maps these value pairs (map_dict) to the original series (s).

Related

Re-arrange 1D pandas DataFrame to 2d by splitting index names

I have a 1D DataFrame that is indexed with keys of the form i_n, where i and n are strings (for the sake of this example, i is an integer number and n is a character). This would be a simple example:
values
0_a 0.583772
1_a 0.782358
2_a 0.766844
3_a 0.072565
4_a 0.576667
0_b 0.503876
1_b 0.352815
2_b 0.512834
3_b 0.070908
4_b 0.074875
0_c 0.361226
1_c 0.526089
2_c 0.299183
3_c 0.895878
4_c 0.874512
Now I would like to re-arrange this DataFrame to be 2D such that the number (the part of the index name before the underscore) serves as column name and the character (the part of the index after the underscore) serves as index:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.0725654 0.576667
b 0.503876 0.352815 0.512834 0.0709081 0.0748752
c 0.361226 0.526089 0.299183 0.895878 0.874512
I have a solution for the problem (the function convert_2d below), but I was wondering, whether there would be a more idiomatic way to achieve this. Here the code that was used to generate the original DataFrame and to convert it to the desired form:
import pandas as pd
import numpy as np
def convert_2d(df):
df2 = pd.DataFrame(columns=['a','b','c'], index=list(range(5))).T
names = set(idx.split('_')[1] for idx in df.index)
numbers = set(idx.split('_')[0] for idx in df.index)
for i in numbers:
for n in names:
df2[i][n] = df['values']['{}_{}'.format(i,n)]
return df2
##generating 1d example data:
data = np.random.rand(15)
indices = ['{}_{}'.format(i,n) for n in ['a','b','c'] for i in range(5)]
df = pd.DataFrame(
data, columns=['values']
).rename(index={i:idx for i,idx in enumerate(indices)})
print(df)
##converting to 2d
print(convert_2d(df))
Some notes about the index keys: it can be assumed (like in my function) that there are no 'missing keys' (i.e. a 2d array can always be achieved) and the only thing that can be taken for granted about the keys is the (single) underscore (i.e. the numbers and letters were only chosen for explanatory reasons, in reality there would be just two arbitrary strings connected by the underscore).
IIUC Create the Multiple index thenunstack
df.index=pd.MultiIndex.from_tuples(df.index.str.split('_').map(tuple))
df['values'].unstack(level=0)
Out[65]:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.072565 0.576667
b 0.503876 0.352815 0.512834 0.070908 0.074875
c 0.361226 0.526089 0.299183 0.895878 0.874512

I need to create a python list object, or any object, out of a pandas DataFrame object grouping pieces of values from different rows

My DataFrame has a string in the first column, and a number in the second one:
GEOSTRING IDactivity
9 wydm2p01uk0fd2z 2
10 wydm86pg6r3jyrg 2
11 wydm2p01uk0fd2z 2
12 wydm80xfxm9j22v 2
39 wydm9w92j538xze 4
40 wydm8km72gbyuvf 4
41 wydm86pg6r3jyrg 4
42 wydm8mzt874p1v5 4
43 wydm8mzmpz5gkt8 5
44 wydm86pg6r3jyrg 5
45 wydm8w1q8bjfpcj 5
46 wydm8w1q8bjfpcj 5
What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value.
So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:
['2828', '9888','8888']
where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.
What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.
I hope it's clear enough.
this can be done easily as follows as a one liner: (considered to be pretty fast too)
result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()
this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.
output:
['2828', '9888', '8888']
Documentation:
pandas.groupby
pandas.apply
Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:
# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])
# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()
Result:
['2828', '9888', '8888']

Pandas really slow join

I am trying to merge 2 dataframes, where I want to use the most recent date's row. Note that the date is not sorted, so it is not possible to use groupby.first() or groupby.last().
Left DataFrame (n=834,570) | Right DataFrame (n=1,592,005)
id_key | id_key date other_vars
1 | 1 2015-07-06 ...
2 | 1 2015-07-07 ...
3 | 1 2014-04-04 ...
Using the groupby/agg example, it takes 8 minutes! When I convert the dates to integers, then it takes 6 minutes.
gb = right.groupby('id_key')
gb.agg(lambda x: x.iloc[x.date.argmax()])
I used my own version where I make a dictionary for id, where I store the date and index of the currently highest date seen. You just iterate over the whole data once, ending up with a dictionary {id_key : [highest_date, index]}.
This way, it is really fast to just find the rows necessary.
It only takes 6 seconds to end up with the merged data; about a 85 times speedup.
I have to admit I'm very surprised as I thought pandas would be optimised for this. Does anyone have an idea what is going on, and whether the dictionary method should also be an option in pandas? It would also be simply to adapt this to other conditions of course, like sum, min etc.
My code:
# 1. Create dictionary
dc = {}
for ind, (ik, d) in enumerate(zip(right['id_key'], right['date'])):
if ik not in dc:
dc[ik] = (d, ind)
continue
if (d, ind) > dc[ik]:
dc[ik] = (d, ind)
# 2. Collecting indices at once (subsetting was slow), so to only subset once.
# It has the same amount of rows as left
inds = []
for x in left['id_key']:
# using this to append the last value that was given (missing strategy, very very few)
if x in dc:
row = dc[x][1]
inds.append(row)
# 3. Take the values
result = right.iloc[inds]

changing height (feet and inches) to an integer in python pandas

I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.
I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.
here is my dataframe:
HeightNoShoes HeightShoes
0 7' 5.5" NaN
1 6' 11" 7' 0.25"
2 6' 7.75" 6' 9"
3 6' 5.5" 6' 6.75"
4 5' 11" 6' 0"
Output should be in inches:
HeightNoShoes HeightShoes
0 89.5 NaN
1 83 84.25
2 79.75 81
3 77.5 78.75
4 71 72
My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.
The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).
It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.
For example:
py_str = "7' 5.5\""
This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"
The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\"", or NaN if there is no valid match:
Code:
r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
m = r.match(el)
if m == None:
return float('NaN')
else:
return int(m.group(1))*12 + float(m.group(2))
Example:
>>> get_inches("7' 5.5\"")
89.5
You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.
One possible method without using regex is to write your own function and just apply it to the column/Series of your choosing.
Code:
import pandas as pd
df = pd.read_csv("test.csv")
def parse_ht(ht):
# format: 7' 0.0"
ht_ = ht.split("' ")
ft_ = float(ht_[0])
in_ = float(ht_[1].replace("\"",""))
return (12*ft_) + in_
print df["HeightNoShoes"].apply(lambda x:parse_ht(x))
Output:
0 89.50
1 83.00
2 79.75
3 77.50
4 71.00
Name: HeightNoShoes, dtype: float64
Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.
Comparison versus the accepted solution:
In [9]: import re
In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
...: def get_inches2(el):
...: m = r.match(el)
...: if m == None:
...: return float('NaN')
...: else:
...: return int(m.group(1))*12 + float(m.group(2))
...:
In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop
In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop
parse_ht is a little more than twice as fast.
First create the dataframe of height values
Let's first set up a Pandas dataframe to match the question. Then convert the values shown in feet and inches to a numerical value using apply. NOTE: The questioner asks if the values can be converted to integers, however the first value in the 'HeightNoShoes' column is 7' 5.5" Since this string value is expressed in half inches, it will be converted first to a float value. Then you can use the round function to round it before typcasting the values as integers.
# libraries
import pandas as pd
# height data
no_shoes = ['''7' 5.5"''',
'''6' 11"''',
'''6' 7.75"''',
'''6' 5.5" ''',
'''5' 11"''']
shoes = [np.nan,
'''7' 0.25"''',
'''6' 9"''',
'''6' 6.75"''',
'''6' 0"''']
# put height data into a Pandas dataframe
height_data = pd.DataFrame({'HeightNoShoes':no_shoes, 'HeightShoes':shoes})
height_data.head()
Next use a function to convert feet to float values
Here is a function that converts feet and inches to a float value.
def feet_to_float(cell_string):
try:
split_strings = cell_string.replace('"','').replace("'",'').split()
float_value = float(split_strings[0])+float(split_strings[1])
except:
float_value = np.nan
return float_value
Next, apply the function to each column in the dataframe.
# obtain a copy of the height data
df = height_data.copy()
for col in df.columns:
print(col)
df[col] = df[col].apply(feet_to_float)
df.head()
Here is a function to convert float values to integer values with NaN values in the Pandas column
If you would like to convert the dataframe to integer values with a NaN value in one column you can use the following function and code. Note, that the function rounds the values first before typecasting them as integers. Typecasting the float values as integers before rounding them will just truncate the values.
def float_to_int(cell_value):
try:
return int(round(cell_value,0))
except:
return cell_value
for col in df.columns:
df[col] = df[col].apply(feet_to_float)
Note: Pandas displays columns that contain both NaN values and integers as float values.
Here is the code to convert a single column in the dataframe to a numerical value.
df = height_data.copy()
df['HeightNoShoes'] = df['HeightNoShoes'].apply(feet_to_float)
df.head()
This is how to convert the single column of float values to integers. Note, that it's important to round the values first. Typecasting the values as integers before rounding them will incorrectly truncate the values.
df['HeightNoShoes'] = round(df['HeightNoShoes'],0).astype(int)
df.head()
There are NaN values in the second Pandas column labeled 'HeightShoes'. Both the feet_to_float and float_to_int functions found above should be able to handle these.
df = height_data.copy()
df['HeightShoes'] = df['HeightShoes'].apply(feet_to_float)
df['HeightShoes'] = df['HeightShoes'].apply(float_to_int)
df.head()
This may also serve the purpose
def inch_to_cm(x):
if x is np.NaN:
return x
else:
ft,inc = x.split("'")
inches = inc[1:-1]
return ((12*int(ft)) + int(inches)) * 2.54
df['Height'] = df['Height'].apply(inch_to_cm)
Here is a way using str.extract()
(df.stack()
.str.extract(r"(\d+)' (\d+\.?\d*)")
.rename({0:'feet',1:'inches'},axis=1)
.astype(float)
.assign(feet = lambda x: x['feet'].mul(12))
.sum(axis=1)
.unstack())
Output:
HeightNoShoes HeightShoes
0 89.50 NaN
1 83.00 84.25
2 79.75 81.00
3 77.50 78.75
4 71.00 72.00

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Categories

Resources