Apply json.loads for a column of dataframe with dask - python

I have a dataframe fulldb_accrep_united of such kind:
SparkID ... Period
0 913955 ... {"#PeriodName": "2000", "#DateBegin": "2000-01...
1 913955 ... {"#PeriodName": "1999", "#DateBegin": "1999-01...
2 16768 ... {"#PeriodName": "2007", "#DateBegin": "2007-01...
3 16768 ... {"#PeriodName": "2006", "#DateBegin": "2006-01...
4 16768 ... {"#PeriodName": "2005", "#DateBegin": "2005-01...
I need to convert Period column, which is now column of strings into a column of json values. Usually I do it with df.apply(lambda x: json.loads(x)), but this dataframe is too large to process it as a whole. I want to use dask, but I seem to miss something important. I think I don't understand how to use apply in dask, but I can't find out the solution.
The codes
This is how I supposed to do it if using Pandas with all df in memory:
#%% read df
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = pd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', index_col = 0, encoding = 'utf-8')
os.chdir('..')
#%% Deleting some freaky symbols from column
condition = fulldb_accrep_united['Period'].str.contains('\\xa0', na = False, regex = False)
fulldb_accrep_united.loc[condition.values, 'Period'] = fulldb_accrep_united.loc[condition.values, 'Period'].str.replace('\\xa0', ' ', regex = False).values
#%% Convert to json
fulldb_accrep_united.loc[fulldb_accrep_united['Period'].notnull(), 'Period'] = fulldb_accrep_united['Period'].dropna().apply(lambda x: json.loads(x))
This is the code where i try to use dask:
#%% load data with dask
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = dd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', encoding = 'utf-8', blocksize = 16 * 1024 * 1024) #16Mb chunks
os.chdir('..')
#%% setup calculation graph. No work is done here.
def transform_to_json(df):
condition = df['Period'].str.contains('\\xa0', na = False, regex = False)
df['Period'] = df['Period'].mask(condition.values, df['Period'][condition.values].str.replace('\\xa0', ' ', regex = False).values)
condition2 = df['Period'].notnull()
df['Period'] = df['Period'].mask(condition2.values, df['Period'].dropna().apply(lambda x: json.loads(x)).values)
result = transform_to_json(fulldb_accrep_united)
The last cell here gives error:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
What I do wrong? I tried to find similar topics for almost 5 hours, but I think I am missing something important, cause I am new to the topic.

Your question was long enough that I didn't read through all of it. My apologies. See https://stackoverflow.com/help/minimal-reproducible-example
However, based on the title, it may be that you want to apply the json.loads function across every element in a dataframe's column
df["column-name"] = df["column-name"].apply(json.loads)

Related

A progress bar for my function (Python, pandas)

Im trying to read big csv files and also effectively work on other stuff at the same time. That is why my solution to this problem is to create a progress bar (something that shows how far Ive come threw out the read that gives me a sense of what time I have before the read is complete). However I have tried using tqdm aswell as ownmade while loops, but to my disfortune, I have not found a solution to this problem. I have tried using this thread: How to see the progress bar of read_csv
without no luck. Maybe I can apply TQDM in a different way? Are there any other solutions?
Heres the important part of the code (the one I want to add a progress bar to)
def read_from_csv(filepath: str,
sep: str = ",",
header_line: int = 43,
skip_rows: int = 48) -> pd.DataFrame:
"""Reads a csv file at filepath containing the vehicle trip data and
performs a number of formatting operations
"""
# The first call of read_csv is used to get the column names, which allows
# the typing to take place at the same time as the second read, which is
# faster than forcing type afterwards
df_names: pd.Index[str] = pd.read_csv(
filepath,
sep = sep,
header = header_line,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
nrows = 0,
encoding = 'iso-8859-1'
).columns
# The "Time" and "Time_abs" columns have some inconsistent
# "Storage group code" preceeding the actual column name, so their
# full column names are stored so they can be renamed later. Also, we want
# to interpret "Time_abs" as a string, while the rest are floats. This is
# stored in a dict to use in the next call to read_csv
time_col = ""
time_abs_col = ""
names_dict = {}
for name in df_names:
if ": Time_abs" in name:
names_dict[name] = 'str'
time_abs_col = name
elif ": Time" in name:
time_col = name
else:
names_dict[name] = 'float'
# A list of values that we want pandas to interpret as having no value.
# "NOVALUE" is the only one of these that's actually used in the files,
# the rest are copy-pasted defaults.
na_vals = ['', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a',
'nan', 'null', 'NOVALUE']
# The whole file is parsed and put in a dataframe
df: pd.DataFrame = pd.read_csv(filepath,
sep = sep,
skiprows = skip_rows,
header = 0,
names = df_names,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
na_values = na_vals,
dtype = names_dict,
encoding = 'iso-8859-1'
)
# Renames the "Time" and "Time_abs" columns so they don't include the
# storage group part
df.rename(columns = {time_col: "Time", time_abs_col: "Time_abs"},
inplace = True)
# Second retyping of this column (here from string to datetime).
# Very rarely, the Time_abs column in the csv data only has the time and
# not the date, in which case this line throws an error. We manage this by
# simply letting it stay as a string
try:
df[defs.time_abs] = pd.to_datetime(df[defs.time_abs])
except:
pass
# Every row ends with an extra delimiter which python interprets as another
# column, but it's empty so we remove it. This is not really necessary, but
# is done to reduce confusion when debugging
df.drop(df.columns[-1], axis=1, inplace=True)
# Adding extra columns to the dataframe used later
df[defs.lowest_gear] = np.nan
df[defs.lowest_speed] = np.nan
for i in list(defs.second_trailer_axles_dict.values()):
df[i] = np.nan
return df
Its the reading csv that takes a lot of time thats why that is the point of interest to add a progress bar to.
Thank you in advance!
You can easily do this with Dask. For example:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
ddf = dd.read_csv(path, blocksize=1e+6)
with ProgressBar():
df = ddf.compute()
[########################################] | 100% Completed | 37.0s
And you will see the file download process.
the blocksize parameter is responsible for the blocks that your file is read with. By changing it, you can achieve good performance. Plus, Dask uses several threads for reading by default, which will speed up the reading process itself.
You can use tqdm.
Somewhere in your function:
def read_from_csv(filepath: str,
sep: str = ",",
header_line: int = 43,
skip_rows: int = 48,
chunksize: int = 10000) -> pd.DataFrame:
# Count the total lines of the file
# Overhead: 3.73s for 10,000,000 lines / 4.2G on a SSD
length = sum(1 for row in open('large.csv', 'r'))
data = []
with tqdm(total=1 + (length // chunksize)) as pbar:
# Replace your 2nd pd.read_csv by this:
for chunk in pd.read_csv('large.csv', ..., chunksize=chunksize):
data.append(chunk)
pbar.update(chunksize)
df = pd.concat(data)
Since there was a question in the comments about the progress bar for some pandas dataframe methods, I will note the solution for such cases. Library parallelbar allows you to track the progress for such popular methods of the Pool class of the multiprocessing module as map, imap and imap_unordered. It is easy to adapt it for parallel work with pandas dataframes (and track progress) as follows:
# pip install parallelbar
from parallelbar import progress_map
import pandas as pd
import numpy as np
from multiprocessing import cpu_count
def parallelize_dataframe(df, func, split_size=cpu_count() * 4, **kwargs):
df_split = np.array_split(df, split_size)
result_df = pd.concat(progress_map(func, df_split, **kwargs),
ignore_index=True)
return result_df
Where df - your dataframe; func - the function to be applied to the dataframe; split_size - how many parts you need to split df into for parallelization (usually the default value is a good choice); **kwargs-optional keyword arguments for progress_map function (see the documentation)
For example:
def foo(df):
df[col] = pd.to_datetime(df[col])
return df
if __name__=='__main__':
new_df = parallelize_dataframe(df, foo)
not only will you see the progress of execution, but the execution of the pd.to_datetime function will be parallelized, which will significantly speed up your work.

Can this pandas workflow be converted to dask?

Please be nice - I'm not a proper programmer, I'm a scientist and I've read as many docs on this as I can find (they're a bit sparse).
I'm trying to convert this pandas code into dash because my input file is ~0.5TB with gz and it loads too slowly in native pandas. I have a 3 TB machine, btw.
This is an example of what I'm doing with pandas:
df = pd.DataFrame([['chr1',33329,17,'''33)'6'4?1&AB=?+..''','''X%&=E&!%,0("&"Y&!'''],
['chr1',33330,15,'''6+'/7=1#><C1*'*''','''X%=E!%,("&"Y&&!'''],
['chr1',33331,13,'''2*3A#/9#CC3--''','''X%E!%,("&"Y&!'''],
['chr1',33332,1,'''4**(,:3)+7-#<(0-''','''X%&E&!%,0("&"Y&!'''],
['chr1',33333,2,'''66(/C=*42A:.&*''','''X%=&!%0("&"&&!''']],
columns = ['chrom','pos','depth','phred','map'])
df.loc[:,'phred'] = [(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df.loc[:,"map"] = [(sum(map(ord,i)))/len(i) for i in df.loc[:,"map"]]
df = df.astype({'phred': 'int32', 'map': 'int32'})
df.query('(depth < 10) | (phred < 7) | (map < 10)', inplace=True)
for chrom, df_tmp in df.groupby('chrom'):
df_end = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(-1)-1))]
df_start = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(+1)+1))]
for start, end in zip(df_start.pos, df_end.pos):
print (start, end)
Gives
33332 33333
This works (to find regions of a cancer genome with no data) and it's optimised as much as I know how.
I load the real thing like:
df = pd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
index_col=None,
usecols=[0,1,3,5,6],
names = ['chrom','pos','depth','phred','map']
)
and I can do the same with Dask (way faster!):
df = dd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
usecols=[0,1,3,5,6],
compression='gzip',
blocksize=None,
names = ['chrom','pos','depth','phred','map']
)
but i'm stuck here:
ff=[(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df['phred'] = ff
Error: Column assignment doesn't support type list
Question - is this sort of thing possible? If so are there good tutes somewhere? I need to convert the whole block of pandas code above.
Thanks in advance!
You created list comprehensions to transform 'Fred' and 'map'; I converted these list comps to functions, and wrapped the functions in np.vectorize().
def func_p(p):
return (sum(map(ord, p)) - len(p) * 33) / len(p)
def func_m(m):
return (sum(map(ord, m))) / len(m)
vec_func_p = np.vectorize(func_p)
vec_func_m = np.vectorize(func_m)
np.vectorize() does not make code faster, but it does let you write a function with scalar inputs and outputs, and convert it to a function that takes array inputs and outputs.
The benefit is that we can now pass pandas Series to these functions (I also added the type conversion to this step):
df.loc[:, 'phred'] = vec_func_p( df.loc[:, 'phred']).astype(np.int32)
df.loc[:, 'map'] = vec_func_m( df.loc[:, 'map']).astype(np.int32)
Replacing the list comprehensions with these new functions gives the same results as your version (33332 33333).
#rpanai noted that you could eliminate the for loops. The following example uses groupby() (and a couple helper columns) to find the start and end position for each contiguous sequence of positions.
Using only pandas built-in functions should be compatible with Dask (and fast).
First, create demo data frame with multiple chromosomes and multiple contiguous blocks of positions:
data1 = {
'chrom' : 'chrom_1',
'pos' : [1000, 1001, 1002,
2000, 2001, 2002, 2003]}
data2 = {
'chrom' : 'chrom_2',
'pos' : [30000, 30001, 30002, 30003, 30004,
40000, 40001, 40002, 40003, 40004, 40005]}
df = pd.DataFrame(data1).append( pd.DataFrame(data2) )
Second, create two helper functions:
rank is a sequential counter for each group;
key is constant for positions in a contiguous 'run' of positions.
df['rank'] = df.groupby('chrom')['pos'].rank(method='first')
df['key'] = df['pos'] - df['rank']
Third, group by chrom and key to create a groupby object for each contiguous block of positions, then use min and max to find start and end value for the positions.
result = (df.groupby(['chrom', 'key'])['pos']
.agg(['min', 'max'])
.droplevel('key')
.rename(columns={'min': 'start', 'max': 'end'})
)
print(result)
start end
chrom
chrom_1 1000 1002
chrom_1 2000 2003
chrom_2 30000 30004
chrom_2 40000 40005

Optimize row access and transformation in pyspark

I have a large dataset(5GB) in the form of jason in S3 bucket.
I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator.
Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
Find the dataset count and named it as totalCount
val totalcount = inputDF.count()
Find the count(col) for all the dataframe columns and get the map of fields to their count
Here for all columns of input dataframe, the count is getting computed
Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
Fetch the first row as Map[colName, count(colName)] referred as fieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
Please note that, this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
I solved the above problem.
We can simply query the dataframe for null values.
df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present.
So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.

KeyError for column that is in Pandas dataframe

I'm having an issue that I can't seem to understand. I've written a function that takes a dataframe as the input and then performs a number of cleaning steps on it. When I run the function I get the error message KeyError: ('amount', 'occurred at index date'). This doesn't make sense to me because amount is a column in my dataframe .
Here is some code with a subset of the data created:
data = pd.DataFrame.from_dict({"date": ["10/31/2019","10/27/2019"], "amount": [-13.3, -6421.25], "vendor": ["publix","verizon"]})
#create cleaning function for dataframe
def cleaning_func(x):
#convert the amounts to positive numbers
x['amount'] = x['amount'] * -1
#convert dates to datetime for subsetting purposes
x['date'] = pd.to_datetime(x['date'])
#begin removing certain strings
x['vendor'] = x['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
x['vendor'] = x['vendor'].str.replace("[0-9]","")
x['vendor'] = x['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
x['vendor'] = x['vendor'].str.translate(table)
return x
clean_data = data.apply(cleaning_func)
If someone could shed some light on why this error appears I would appreciate it.
Don't use apply here, it's slow and basically loops over your dataframe. Just pass the function your data and let it return a cleaned up dataframe, this way it will use the vectorized methods over the whole column.
def cleaning_func(df):
#convert the amounts to positive numbers
df['amount'] = df['amount'] * -1
#convert dates to datetime for subsetting purposes
df['date'] = pd.to_datetime(df['date'])
#begin removing certain strings
df['vendor'] = df['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
df['vendor'] = df['vendor'].str.replace("[0-9]","")
df['vendor'] = df['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
df['vendor'] = df['vendor'].str.translate(table)
return df
clean_df = cleaning_func(data)

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!
Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Categories

Resources