How to add a column to Pandas based off of other columns - python

I'm using Pandas and I have a very basic dataframe:
session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z
I wish to add a third column based on the values of the current columns:
df['key'] = urlsafe_b64encode(md5('l' + df['session_id'] + df['datetime']))
But I receive:
TypeError: must be convertible to a buffer, not Series

You need to use pandas.DataFrame.apply. The code below will apply the lambda function to each row of df. You could, of course, define a separate function (if you need to do more something more complicated).
import pandas as pd
from io import StringIO
from base64 import urlsafe_b64encode
from hashlib import md5
s = ''' session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z'''
df = pd.read_csv(StringIO(s), sep='\s+')
df['key'] = df.apply(lambda x: urlsafe_b64encode(md5('l' + x['session_id'] + x['datetime'])), axis=1)
Note: I couldn't get the hashing bit working on my machine unfortunately, some unicode error (might be because I'm using Python 3) and I don't have time to debug the inner workings of it, but the pandas part I'm pretty sure about :P

Related

Filter Data with Pandas in Pycharm

This is my code so far in Pycharm for my Streamlit Data app:
import pandas as pd
import plotly.express as px
import streamlit as st
st.set_page_config(page_title='Matching Application Number',
layout='wide')
df = pd.read_csv('Analysis_1.csv')
st.sidebar.header("Filter Data:")
MeetingFileType = st.sidebar.multiselect(
"Select File Type:",
options=df['MEETING_FILE_TYPE'].unique(),
default=df['MEETING_FILE_TYPE'].unique()
)
df_selection = df.query(
'MEETING_FILE_TYPE == #MeetingFileType'
)
st.dataframe(df_selection)
The result of my code on streamlit is this below:
Application_ID MEETING_FILE_TYPE
BBC#:1010 1
NBA#:1111 2
BRC#:1212 1
SAC#:1412 4
QRD#:1912 2
BBA#:1092 4
But, I would like to filter the data and only return Application_ID results for just MEETING_FILE_TYPE 1&2.
I am looking for this result below:
Filter Data: Application_ID MEETING_FILE_TYPE
select type: BBC#:1010 1
1 2 NBA#:1111 2
BRC#:1212 1
QRD#:1912 2
The .isin() function is useful for creating a vector of Bools on which to filter the rows of your DataFrame. For filtering categorical columns, as in your example, it's the simplest way to go. Documentation here.
select_type = [1,2]
df = df[df['MEETING_FILE_TYPE'].isin(select_type)]
It's not necessary in this example, but pairing the ~, which gives the reverse Bool value, along with .isin() often comes in handy. Someone covered this here.

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

How does Pandas.read_csv type casting work?

Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.
The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
The problem is date_parser is designed specifically for conversion to datetime:
date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array of datetime
instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN with empty strings, you can post-process with fillna:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object

Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Categories

Resources