Splitting datasets of various size into separate dataframes - python

I am new in Python and pandas.
I have a .csv file exported from some measurement equipment that has given me all the measurements taken over the course of a day in a single .csv file.
I have already managed to produce a fairly tidy dataframe but I cannot work out. How to separate the measurements??
The dataframe is structured as follows:
+-------------------------+-------------------+--------------------+-----------+------+-------------+------+--------------+
| SetupTitle | measurement_type | nan | nan | nan | nan | nan | nan |
| MetaData | TestRecord | measurement number | nan | nan | nan | nan | nan |
| DataName | voltage 1 | voltage 2 | current 1 | ... | | | data name 8 |
| DataValues | data 1 | ... | | | | | data 8 |
| ... | | | | | | | |
| hundreds of data points | | | | | | | |
| ... | | | | | | | |
| SetupTitle | measurement type | nan | nan | nan | nan | nan | nan |
| etc... | | | | | | | |
+-------------------------+-------------------+--------------------+-----------+------+-------------+------+--------------+
I would like to split each measurement into individual dataframes by using the "SetupTitle" value as a start point, but I'm not sure how to iterate through the column or how to extract the rest of the columns from each.
I think once they are split up I will be able to remove the setup row and metadata row and use them to name the dataframe which will give me a nice dataset.

You can use cumsum to count the occurrences of a specific value and groupby to separate them:
s = df[name_of_column].eq('SetupTitle').cumsum()
then value of s will change every time there's a SetupTitle in your column. And you can access the blocks by:
# say we want to store them in a dict:
blocks = {}
for num_block, block in df.groupby(s):
# do whatever you want with the group
blocks[num_block] = block

Related

Fuzzymatcher returns NaN for best_match_score

I'm observing odd behaviour while performing fuzzy_left_join from fuzzymatcher library. Trying to join two df, left one with 5217 records and right one with 8734, the all records with best_match_score is 71 records, which seems really odd . To achieve better results I even remove all the numbers and left only alphabetical charachters for joining columns. In the merged table the id column from the right table is NaN, which is also strange result.
left table - column for join "amazon_s3_name". First item - limonig
+------+---------+-------+-----------+------------------------------------+
| id | product | price | category | amazon_s3_name |
+------+---------+-------+-----------+------------------------------------+
| 1 | A | 1.49 | fruits | limonig |
| 8964 | B | 1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C | 2.79 | beverages | studencajfuzelimonilimtreval |
+------+---------+-------+-----------+------------------------------------+
right table - column for join "amazon_s3_name" - last item - limoni
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| id | picture | amazon_s3_name |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| 191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg | ahmadcajlimonidjindjifilxg |
| 192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg | ahmadcajlimonidjindjifilxgg |
| 204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg | ahmadcajlimonidjindjifilxgg |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg | cajstudenfuzetealimonilimonovatrevalpet |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg | lesieursalatensosslimonimaslinovomaslo |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg | limoni |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
merged table - as we can see in the merged table best_match_score is NaN
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| 0 | NaN | 0_left | None | 1.49 | Fruits | Limoni500g09700112 | NaN | limonig | NaN | NaN |
| 2 | NaN | 2_left | None | 1.69 | Bio | Morkovi1kgbr09700132 | NaN | morkovikgbr | NaN | NaN |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
You could give polyfuzz a try. Use the examples' setup, for example using TF-IDF or Bert, then run:
model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']
then merge:
df1.merge(df2, left_on='To', right_on='amazon_s3_name')

Create a dataframe for stock analysis using a datetimeindex timeseries data source

I have a datasource which gives me the following dataframe, pricehistory:
+---------------------+------------+------------+------------+------------+----------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+----------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601.0 | True |
| 2020-01-02 11:16:00 | 321.430623 | 321.465419 | 321.395827 | 321.465419 | 42678.0 | True |
| 2020-01-02 11:17:00 | 321.425652 | 321.445536 | 321.375944 | 321.440565 | 39827.0 | True |
| 2020-01-02 11:33:00 | 321.137343 | 321.261614 | 321.137343 | 321.261614 | 102805.0 | True |
| 2020-01-02 12:11:00 | 321.256643 | 321.266585 | 321.241731 | 321.266585 | 25629.0 | True |
| 2020-01-02 12:12:00 | 321.246701 | 321.266585 | 321.231789 | 321.266585 | 40869.0 | True |
| 2020-01-02 13:26:00 | 321.226818 | 321.266585 | 321.226818 | 321.261614 | 44011.0 | True |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351.0 | True |
| 2020-01-03 10:49:00 | 320.988217 | 321.077692 | 320.988217 | 321.057809 | 84492.0 | True |
| etc... | etc... | etc... | etc... | etc... | etc... | etc. |
+---------------------+------------+------------+------------+------------+----------+------+
Output of pricehistory.dtypes:
close float64
high float64
low float64
open float64
volume float64
red bool
dtype: object
Output of pricehistory.index.dtype:
dtype('<M8[ns]')
Note: This dataframe is large, each row is 1-min of data and spans for months, so there are many time frames to iterate over.
Question:
I have some specific criteria I'd like to use that will become columns in a new dataframe:
High price and time (minute) of each day for the entire dataframe
The first occurrence of 4 downward trending minutes open < close during the day with their respective times
So far, I'm not exactly sure how to pull the time (datetimeindex value) and high price from pricehistory.
For (1) above, I'm using pd.DataFrame(pricehistory.high.groupby(pd.Grouper(freq='D')).max()) which gives me:
+------------+------------+
| time | high |
+------------+------------+
| | |
| 2020-01-02 | 322.956677 |
| 2020-01-03 | 321.753729 |
| 2020-01-04 | NaN |
| 2020-01-05 | NaN |
| 2020-01-06 | 321.843204 |
| etc... | etc... |
+------------+------------+
But this doesn't work because it's only giving me the day and not down to the minute, and using min as the Grouper freq doesn't work because then it's just the max value of each min, which is high.
Desired outcome (note: minutes included):
+---------------------+------------+
| time | high |
+---------------------+------------+
| | |
| 2020-01-02 9:31:00 | 322.956677 |
| 2020-01-03 10:13:11 | 321.753729 |
| 2020-01-04 15:33:12 | 320.991231 |
| 2020-01-06 12:01:23 | 321.843204 |
| etc... | etc... |
+---------------------+------------+
For (2) above, I'm using the following:
pricehistory['red'] = pricehistory['close'].lt(pricehistory['open'])
To make a new column in pricehistory which shows us if there are 4 red minutes in a row.
Then, using new_pricehistory = pricehistory.loc[pricehistory[::-1].rolling(4)['red'].sum().eq(4)], this gives a new dataframe of only the rows where 4 red minutes in a row occur, preferably I'd like to only have the very first occurrence, not all.
Current output:
+---------------------+------------+------------+------------+------------+--------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+--------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601 | TRUE |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351 | TRUE |
| 2020-01-06 10:49:00 | 320.520956 | 320.570665 | 320.501073 | 320.550781 | 71901 | TRUE |
+---------------------+------------+------------+------------+------------+--------+------+
Given you didn't provide data I create some dummy one. By SO policy you should make different question per problem. For now I'm answering the first one.
Generate data
import pandas as pd
import numpy as np
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"high":np.random.randn(len(times))})
Question 1
Here I just look for index where the maximum per day occour and filter df accordingly
idx = df.groupby(df["time"].dt.date)["high"].idxmax().values
df[df.index.isin(idx)]
Update: In case you have time as index in your df the solution will be
df = df.set_index("time")
idx = df.groupby(pd.Grouper(freq='D'))["high"].idxmax().values
df[df.index.isin(idx)]
Question 2
import pandas as pd
import numpy as np
# generate data
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"open":np.random.randn(len(times))})
df["open"] = np.where(df["open"]<0, -1 * df["open"], df["open"])
df["close"] = df["open"] + 0.01 *np.random.randn(len(times))
df = df.set_index("time")
df["red"] = df['close'].lt(df['open'])
# this function return the first time
# when there are 4 consecutive red
def get_first(ts):
idx = ts.loc[ts[::-1].rolling(4)['red'].sum().ge(4)].index
if idx.empty:
return pd.NaT
else:
return idx[0]
# get first time within group and drop nan
grp = df.groupby(pd.Grouper(freq='D'))\
.apply(get_first).dropna()
df[df.index.isin(grp.values)]

Is there any way make column for each category under each group?

There are two column both are categorical. I want to groupby by first columns lets age_group and my other column Engagement_category should make new column by each element.
I did the groupby on first column age_group.
| age_group | Engagement_category |
|--------------|---------------------|
| 21-26 | Not Engaged |
| 27-32 | Disengaged |
| 33-38 | Engaged |
| 39-44 | Disengaged |
| 45-50 | Not Engaged |
| 50 and Above | Engaged |
group = df.groupby('age_group')
The below one is required output:
| age_group | Engaged | Nearly Engaged | Not Engaged | Disengaged |
|-----------|---------|----------------|-------------|------------|
| 21-26 | 3 | 4 | 1 | 1 |
| 27-32 | 4 | 0 | 4 | 0 |
| 33-38 | 2 | 0 | 1 | 1 |
Thank you.
You want to group by the first two columns then use the "COUNT" function to aggregate across the dataframe. Try this:
df.groupby(['Age', 'Engagement_category']).agg(['count'])

How to make a Pandas count pivot table with related children items?

I'm stuck with a little problem with python and pandas dataframe.
I want to make a pivot table that count some relates items.
I got a dataframe with this structure.
+-----+------------+-----------+-----------+
| ID | Item_Type | Basket_ID | OwnerName |
+-----+------------+-----------+-----------+
| 3 | Basket | | |
| 336 | ChickenEgg | 3 | Henk |
| 841 | SomeEgg | 3 | Henk |
| 671 | EasterEgg | 3 | Piet |
| 9 | Basket | | |
| 336 | Orange | 9 | Piet |
| 841 | Banana | 9 | Piet |
| 671 | Strawberry | 9 | Herman |
| 888 | Apple | 9 | Herman |
| 821 | Apricots | 9 | NaN |
+-----+------------+-----------+-----------+
I want to count how many items are related to the ‘Basket’ item (Parent) and how often the ‘OwnerName’ appears with the related ‘Basket’ item.
I want my dataframe like below.
You can see the total item count from the Items that are related with the parent Item_Type ‘Basket’, and the total count how often the name appears.
You can also see how many ‘Total_Owners’ are and also item without a owner.
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| ID | Item_Type | Total_Items_Count | Henk_Count | Piet_Count | Herman_Count | Total_Owners | Total_NaN_Values |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| 3 | Basket | 3 | 2 | 1 | 0 | 3 | |
| 9 | Basket | 5 | 0 | 2 | 2 | 4 | 1 |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
Answering your question requires multiple steps but the core idea is that you should use pivot_table.
The table is conceptually a multilevel index. Basket ID is the high-level index and 'ID' is the more granular level index. First thing you have to do is remove the lines where basket_id is missing so that the granularity of the table is consistent.
Let's say you named your dataframe df.
# Preparation steps
df = df[~df["Basket_ID"].isna()] # Remove the rows that shouldnt be counted.
df.loc[df["OwnerName"].isna(),"OwnerName"] = "unknown" # set missing to arbitrary value
# Make a pivot table
df = df.pivot_table(index=['Basket_ID'],columns=['OwnerName'],values=['Item_Type'],aggfunc='count').fillna(0)
From there onwards you should be able to calculate your remaining columns

How to Group Data with Matching Values

I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.
The pandas agg function on a groupby is what you want.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.

Categories

Resources