Dask categorize() won't work after using .loc - python

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?

The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Related

Group by time intervals and additional attribute

I have this data:
import pandas as pd
data = {
'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}
df = pd.DataFrame(data)
I want to create two sets of CSVs:
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.
The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.
The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.
1st output can look like this for each vehicle type:
2nd output:
I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.
I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.
I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.
And I don't know if this is applicable.
If I understand you correctly, one approach could be as follows:
Data
import pandas as pd
# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset
# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33',
'2022-11-03 00:00:35', '2022-11-03 00:00:46',
'2022-11-03 00:20:21', '2022-11-03 00:21:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
df = pd.DataFrame(data)
print(df)
timestamp from to type
0 2022-11-03 00:00:06 A B Car
1 2022-11-03 00:00:33 A B Car
2 2022-11-03 00:00:35 A B Van
3 2022-11-03 00:00:46 A C Car
4 2022-11-03 00:20:21 B C HGV
5 2022-11-03 00:21:30 C B Van
# e.g. for FROM we want: `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want: `A-B`, 3 (COUNT), `00:15` (TIME-END)
# `A-C`, 1 (COUNT), `00:15` (TIME-END)
Code
# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')
# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]
for item in nested_list:
# groupby `item` (i.e. `['from']` and `['from','to']`)
# use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x on: `from`
# and get the `count`. Finally, reset the index
out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
# rename the columns using our `cols` dict
out = out.rename(columns=cols)
# convert timestamps like `'2022-11-03 00:15:00' to `00:15`
out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
# rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
if 'TO' in out.columns:
out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
else:
out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
# write output to `csv file`; e.g. use an `f-string` to customize file name
out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
# `index=False` avoids writing away the index
Output (loaded in excel)
Relevant documentation:
pd.to_datetime, df.set_index, .to_offset
df.groupby, .resample
df.rename
.dt.strftime
df.to_csv

Insert rows in Python dataframe with conditions

I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID
Type
Address
Suite
City/State
phone
C1
W_sale
10 foo
Suite A
foo city
888-321-4567
C2
W_sale
11 spam
STE 100
spam town
888-321-4567
C3
W_sale
12 ham
Myhammy
888-321-4567
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

How can I declare a Column as a categorical feature in a DataFrame for use in ml

How can I declare that a given Column in my DataFrame contains categorical information?
I have a Spark SQL DataFrame which I loaded from a database. Many of the columns in this DataFrame have categorical information, but they are encoded as Longs (for privacy).
I want to be able to tell spark-ml that even though this column is Numerical the information is actually Categorical. The indexes of categories may have a few holes, and it is acceptable. (Ex. a column may have the values [1, 0, 0 ,4])
I am aware that there exists the StringIndexer but I would prefer to avoid the hassle of encoding and decoding, specially because I have many columns that have this behavior.
I would be looking for something that looks like the following
train = load_from_database()
categorical_cols = ["CategoricalColOfLongs1",
"CategoricalColOfLongs2"]
numeric_cols = ["NumericColOfLongs1"]
## This is what I am looking for
## this step detects the min and max value of both columns
## and adds metadata to indicate this as a categorical column
## with (1 + max - min) categories
categorizer = ColumnCategorizer(columns = categorical_cols,
autoDetectMinMax = True)
##
vectorizer = VectorAssembler(inputCols = categorical_cols +
numeric_cols,
outputCol = "features")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [categorizer, vectorizer, classifier])
model = pipeline.fit(train)
I would prefer to avoid the hassle of encoding and decoding,
You cannot really avoid this completely. Required metadata for categorical variable is actually a mapping between value and index. Still, there is no need to do it manually or to create a custom transformer. Lets assume you have data frame like this:
import numpy as np
import pandas as pd
df = sqlContext.createDataFrame(pd.DataFrame({
"x1": np.random.random(1000),
"x2": np.random.choice(3, 1000),
"x4": np.random.choice(5, 1000)
}))
All you need is an assembler and indexer:
from pyspark.ml.feature import VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
VectorAssembler(inputCols=df.columns, outputCol="features_raw"),
VectorIndexer(
inputCol="features_raw", outputCol="features", maxCategories=10)])
transformed = pipeline.fit(df).transform(df)
transformed.schema.fields[-1].metadata
## {'ml_attr': {'attrs': {'nominal': [{'idx': 1,
## 'name': 'x2',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0']},
## {'idx': 2,
## 'name': 'x4',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']}],
## 'numeric': [{'idx': 0, 'name': 'x1'}]},
## 'num_attrs': 3}}
This example also shows what type information you provide to mark given element of the vector as categorical variable
{
'idx': 2, # Index (position in vector)
'name': 'x4', # name
'ord': False, # is ordinal?
# Mapping between value and label
'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']
}
So if you want to build this from scratch all you have to do is correct schema:
from pyspark.sql.types import *
from pyspark.mllib.linalg import VectorUDT
# Lets assume we have only a vector
raw = transformed.select("features_raw")
# Dictionary equivalent to transformed.schema.fields[-1].metadata shown abov
meta = ...
schema = StructType([StructField("features", VectorUDT(), metadata=meta)])
sqlContext.createDataFrame(raw.rdd, schema)
But it is quite inefficient due to required serialization, deserialization.
Since Spark 2.2 you can also use metadata argument:
df.withColumn("features", col("features").alias("features", metadata=meta))
See also Attach metadata to vector column in Spark
Hey zero323 I used the same technique to look at the metadata and I coded up this Transformer.
def _transform(self, data):
maxValues = self.getOrDefault(self.maxValues)
categoricalCols = self.getOrDefault(self.categoricalCols)
new_schema = types.StructType(data.schema.fields)
new_data = data
for (col, maxVal) in zip(categoricalCols, maxValues):
# I have not decided if I should make a new column or
# overwrite the original column
new_col_name = col + "_categorical"
new_data = new_data.withColumn(new_col_name,
data[col].astype(types.DoubleType()))
# metadata for a categorical column
meta = {u'ml_attr' : {u'vals' : [unicode(i) for i in range(maxVal + 1)],
u'type' : u'nominal',
u'name' : new_col_name}}
new_schema.add(new_col_name, types.DoubleType(), True, meta)
return data.sql_ctx.createDataFrame(new_data.rdd, new_schema)

Categories

Resources