PySpark: Create a subset of a dataframe for all dates

PySpark: Create a subset of a dataframe for all dates - python

I have a DataFrame that has a lot of columns and I need to create a subset of that DataFrame that has only date values.
For e.g. my Dataframe could be:
1, 'John Smith', '12/10/1982', '123 Main St', '01/01/2000'
2, 'Jane Smith', '11/21/1999', 'Abc St', '12/12/2020'
And my new DataFrame should only have:
'12/10/1982', '01/01/2000'
'11/21/1999', '12/12/2000'
The dates could be of any format and could be on any column. I can use the dateutil.parser to parse them to make sure they are dates. But not sure how to call parse() on all the columns and only filter those that return true to another dataframe, easily.

If you know what you columns the datetimes are in it's easy:
pd2 = pd[["row_name_1", "row_name_2"]]
# or
pd2 = pd.iloc[:, [2, 4]]

You can find your columns' datatype by checking each tuple in your_dataframe.dtypes.
schema = "id int, name string, date timestamp, date2 timestamp"
df = spark.createDataFrame([(1, "John", datetime.now(), datetime.today())], schema)
list_of_columns = []
for (field_name, data_type) in df.dtypes:
if data_type == "timestamp":
list_of_columns.append(field_name)
Now you can use this list inside .select()
df_subset_only_timestamps = df.select(list_of_columns)
EDIT: I realized your date columns might be StringType.
You could try something like:
df_subset_only_timestamps = df.select([when(col(column).like("%/%/%"), col(column)).alias(column) for column in df.columns]).na.drop()
Inspired by this answer. Let me know if it works!

Related

JSON Fields with Panda DataFrame

I'm using Python (google colb) and I have a json dataframe with some fields like:
[{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
I need to get the "ActedAt" in order to get the "date" how can I get this?
Thanks!

You have an array of dictionaries. First, grab a dictionary from the array by index, then proceed to get the ActedAt property. Something like this:
json = [{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
# index into a variable for explicit readability
index = 0
# get the date you want
date = json[index]['ActedAt']
print(date)

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?

It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen

Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Pandas MultiIndex with an unrecognised time format - how to convert time and apply calculation

EDIT: Thanks to Scott Boston for advising me on to correctly post.
I have a dataframe containing clock in/out date and times from work for all employees. Sample df input is below, but the real data set has a year of data for many employees.
Question:
What I would like to do is to calculate the time spent in work for each employee over the year.
df = pd.DataFrame({'name': ['Joe Bloggs', 'Joe Bloggs', 'Joe Bloggs',
... 'Joe Bloggs', 'Jane Doe', 'Jane Doe', 'Jane Doe',
... 'Jane Doe'],
... 'Date': ['2020-06-19','2020-06-19' , '2020-06-18', '2020-06-18', '2020-06-19',
... '2020-06-19', '2020-06-18', '2020-06-18'],
... 'Time': ["17:30:06", "09:00:00", "17:44:00", "08:34:02", "16:30:06",
... "10:00:02", "15:45:33", "09:30:33"],
... 'type': ["Logout", "Login", "Logout",
... "Login", "Logout", "Login",
... "Logout", "Login"]})```

You can do it this way:
#Create a datetime column combining both date and time also create year column
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%Y-%m-%d %H:%M:%S')
df['year'] = df['datetime'].dt.year
#Sort the dataframe by datetime
df = df.sort_values('datetime')
#Create "sessions" worked by Login records
session = (df['type'] == 'Login').groupby(df['name']).cumsum().rename('Session_No')
#Reshape the dataframe to get login and logouts for a session on one row
#The use diff to calculate worked during that session
df_time = df.set_index(['name', 'year', session, 'type'])['datetime']\
.unstack().diff(axis=1).dropna(axis=1, how='all')\
.rename(columns={'Logout':'TimeLoggedIn'})
#Sum on Name and Year
df_time.sum(level=[0,1])
Output:
name year TimeLoggedIn
0 Jane Doe 2020 12:45:04
1 Joe Bloggs 2020 17:40:04
Note: #warped solution works and works well, however, if you had an employee who worked overnight, I think that code breaks down. This answer should capture where an employee works past midnight.

df['Time'] = pd.to_timedelta(df['Time'])
df['Date'] = pd.to_datetime(df['Date'])
df['time_complete'] = df['Time'] + df['Date']
df.groupby(['name', 'Date']).apply(lambda x: (x.sort_values('type', ascending=True)['time_complete'].diff().dropna()))
how it works:
Convert the dates to datetime, to allow grouping.
Convert the times to timedelta, to allow subtraction.
Create a complete time, to incorporate potential nighshifts (as spotted by #ScottBoston)
Then, group by date and employee to isolate those.
So, each group now corresponds to one employee at a specific date.
The individual groups have three columns, 'type' and 'Time', 'time_complete'.
Sorting the columns by 'type' will cause logout to come before login.
Then, we take the difference (column-(n) - column-(n+1)) of column 'time_complete' within each sorted group, which gives the time spent between login and logout.
Finally, we remove null values that arise through None - column-(n).

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?

Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Change order of list of lists according to another list

I have a bunch of CSV-files where first line is the column name, and now I want to change the order according to another list.
Example:
[
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
...
]
The above order differs slightly between the files, but the same column-names are always available.
So the I want the columns to be re-arranged as:
['index','date','name','position']
I can solve it by comparing the first row, making an index for each column, then re-map each row into a new list of lists using a for-loop.
And while it works, it feels so ugly even my blind old aunt would yell at me if she saw it.
Someone on IRC told me to look at on map() and operator but I'm just not experienced enough to puzzle those together. :/
Thanks.

Plain Python
You could use zip to transpose your data:
data = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233']
]
columns = list(zip(*data))
print(columns)
# [('date', '2003-02-04', '2003-02-04'), ('index', '23445', '23446'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
It becomes much easier to modify the columns order now.
To calculate the needed permutation, you can use:
old = data[0]
new = ['index','date','name','position']
mapping = {i:new.index(v) for i,v in enumerate(old)}
# {0: 1, 1: 0, 2: 2, 3: 3}
You can apply the permutation to the columns:
columns = [columns[mapping[i]] for i in range(len(columns))]
# [('index', '23445', '23446'), ('date', '2003-02-04', '2003-02-04'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
and transpose them back:
list(zip(*columns))
# [('index', 'date', 'name', 'position'), ('23445', '2003-02-04', 'Steiner, James', '98886'), ('23446', '2003-02-04', 'Holm, Derek', '2233')]
With Pandas
For this kind of tasks, you should use pandas.
It can parse CSVs, reorder columns, sort them and keep an index.
If you have already imported data, you could use these methods to import the columns, use the first row as header and set index column as index.
import pandas as pd
df = pd.DataFrame(data[1:], columns=data[0]).set_index('index')
df then becomes:
date name position
index
23445 2003-02-04 Steiner, James 98886
23446 2003-02-04 Holm, Derek 2233
You can avoid those steps by importing the CSV correctly with pandas.read_csv. You'd need usecols=['index','date','name','position'] to get the correct order directly.

Simple and stupid:
LIST = [
['date', 'index', 'name', 'position'],
['2003-02-04', '23445', 'Steiner, James', '98886'],
['2003-02-04', '23446', 'Holm, Derek', '2233'],
]
NEW_HEADER = ['index', 'date', 'name', 'position']
def swap(lists, new_header):
mapping = {}
for lst in lists:
if not mapping:
mapping = {
old_pos: new_pos
for new_pos, new_field in enumerate(new_header)
for old_pos, old_field in enumerate(lst)
if new_field == old_field}
yield [item for _, item in sorted(
[(mapping[index], item) for index, item in enumerate(lst)])]
if __name__ == '__main__':
print(LIST)
print(list(swap(LIST, NEW_HEADER)))

To rearrange your data, you can use a dictionary:
import csv
s = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
]
new_data = [{a:b for a, b in zip(s[0], i)} for i in s[1:]]
final_data = [[b[c] for c in ['index','date','name','position']] for b in new_data]
write = csv.writer(open('filename.csv'))
write.writerows(final_data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Create a subset of a dataframe for all dates - python

If you know what you columns the datetimes are in it's easy: pd2 = pd[["row_name_1", "row_name_2"]] # or pd2 = pd.iloc[:, [2, 4]]

Related

JSON Fields with Panda DataFrame

Pandas Dataframe from list nested in json

Pandas MultiIndex with an unrecognised time format - how to convert time and apply calculation

Replace values from pandas dataset with dictionary

Change order of list of lists according to another list

Categories

Resources