Slicing Mutliindex data with Pandas - python

I have imported a csv as a multi-indexed Dataframe. Here's a mockup of the data:
df = pd.read_csv("coursedata2.csv", index_col=[0,2])
print (df)
COURSE
ID Course List
12345 Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
22345 Drawing Techniques DESN10016
Colour Theory DESN14049
Finishes & Sustainable Issues DESN12758
Lighting DESN20025
32345 Window Treatments&Soft Furnish DESN27370
42345 Introduction to CADD INFO16859
Principles of Drafting DESN10065
Drawing Techniques DESN10016
The Fundamentals of Design DESN15436
Colour Theory DESN14049
Interior Environments DESN10000
Drafting DESN10123
Textiles and Applications DESN10199
Finishes & Sustainable Issues DESN12758
[17 rows x 1 columns]
I can easily slice it by label using .xs -- eg:
selected = df.xs (12345, level='ID')
print selected
COURSE
Course List
Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
[3 rows x 1 columns]
>
But what I want to do is step through the dataframe and perform an operation on each block of courses, by ID. The ID values in the real data are fairly random integers, sorted in ascending order.
df.index shows:
df.index
MultiIndex(levels=[[12345, 22345, 32345, 42345], [u'Colour Theory', u'Colour Theory ', u'Drafting', u'Drawing Techniques', u'Finishes & Sustainable Issues', u'Interior Environments', u'Introduction to CADD', u'Lighting', u'Principles of Drafting', u'Rendering & Present Skills', u'Textiles and Applications', u'The Fundamentals of Design', u'Window Treatments&Soft Furnish']],
labels=[[0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5, 9, 7, 3, 1, 4, 7, 12, 6, 8, 3, 11, 0, 5, 2, 10, 4]],
names=[u'ID', u'Course List'])
It seems to me that I should be able to use the first index labels to increment through the Dataframe. Ie. Get all the courses for label 0 then 1 then 2 then 3,... but it looks like .xs will not slice by label.
Am I missing something?

So there may be more efficient ways to do this, depending on what you're trying to do to the data. However, there are two approaches which immediately come to mind:
for id_label in df.index.levels[0]:
some_func(df.xs(id_label, level='ID'))
and
for id_label in df.index.levels[0]:
df.xs(id_label, level='ID').apply(some_func, axis=1)
depending on whether you want to operate on the group as a whole or on each row with in it.

Related

Copy and convert text in postgresql column

Let's say I have some JSON stored in postgresql like so:
{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}
This is an inverted index showing the position of each word, which spells out
the time is here the time is now
I want to put the text from the second example in a separate column. I can convert the inverted text with python like so:
def convert_index(inverted_index):
unraveled = {}
for key, values in inverted_index.items():
for value in values:
unraveled[value] = key
sorted_unraveled = dict(sorted(unraveled.items()))
result = " ".join(sorted_unraveled.values())
result = result.replace("\n", "")
return result
But I would love to do this within postgresql so I am not reading text from one column, running a script somewhere else, then adding text in a separate column. Anybody know of a way to go about that? Can I use some kind of script?
You need to get keys with jsonb_each() and unpack arrays with jsonb_array_elements() then aggregate the keys with proper order:
with my_table(json_col) as (
values
('{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}'::jsonb)
)
select string_agg(key, ' ' order by ord::int)
from my_table
cross join jsonb_each(json_col)
cross join jsonb_array_elements(value) as e(ord)
Test it in Db<>fiddle.

Compare two Pandas dataframe for addition of any new rows with respect to the column

I am creating parser of changes on pseudo-table web application to push notification if there any rows were added.
Mechanic of the pseudo-table: Table on the website changes per some time and adds new rows. This page is highly dynamic and sometimes changes the existing rows. Pseudo-table automatically assigns id respecting to the sorting mechanic. So to explain precisely, sorting algorithm is alphabetic so guy ID named Adam would be 1, Bob = 2, Coul=3. But if they will add person with name Caul it would become ID 3, when Coul would become 4. This ruins all the methods I have tried so far.
I am trying right now to compare two Pandas dataframe to detect row addition and return new-added rows. I do not want to return existing rows that were changed. I tried by using concat and removing duplicates but this results in duplicate rows where there was any minor change in the data.
TL;DR EXAMPLE
Input
d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
# ... code
Output should be
3 Great Guy
You could try a simpler solution:
df2[ ~df2.Name.isin(df1.Name)].dropna()
Output:
# Name
2 3 Great Guy
merge dfs with (how = outer), then compare merged df to list of original Names
>>> merged = pd.merge(df_1,df_2,on='Name', how = 'outer')
>>> [x for x in enumerate(merged.Name) if x[1] not in list(df_1.Name)]
Results in: [(3, 'Great Guy')]
I found out the subset parameter in the drop_duplicates.
d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
df_1 = df_1.set_index('#')
df_2 = df_2.set_index('#')
df = pd.concat([df_1,df_2]).drop_duplicates(subset=['Name'], keep=False)
df
results in
Name
#
3 Great Guy
This solves my question.

Python 3.x: Perform analysis on dictionary of dataframes in loops

I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])

How to use a pandas Styler function to color an html table by column value

I have a dataframe that I'm rendering to an html table that I'd like to style using a gradient coloring based on a target value for each column.
Ideally, the target value would not have a background applied, and the other values would have shades of red or green depending on whether they were higher or lower than the target value.
df = pd.DataFrame({'Col 1': [7, 8, 12, 4, 'Target Val 1'],
'Col 2': [1, 4, 5, 8, 'Target Val 2'],
'Col 3': [7, 5, 3, 24, 'Target Val 3']})
I've tried to use Styler.background_gradient but I can only get that to use the largest value in the column as the high for the gradient and the target value ends up styled along with the rest of the column.
Edit Current code:
cm = sns.light_palette('green', as_cmap = True)
[t.style.set_table_attributes('border="1" class="dataframe table table-hover
table-bordered"').background_gradient(cm).set_properties(**{'font-size':
'12px', 'font-family':'Calibri', 'text-align':'center'}).format("{:.0f}%",
subset=percent_cols).render() for t in style_teams]

large nested lists versus dictionaries

Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.

Categories

Resources