How to identify zones in a table using pandas? - python

I have a file with a table (.csv file).
The table is composed by many sub "areas" like this example:
As you can see, there are more some data which can be grouped together (blue group, orange group, etc.)
Now.. the color is just to make the concept clear, but in the .csv there is no group identified by a color. In reality there is no color to identify the groups and the groups dimensions (rows) can change. There is no pattern to predict where the next group has 1, 2, 3, 4 or more rows.
The problem is that I need to open the table and import it using a dataframe using pandas. In my algorithm one group should be identified, copied to another dataframe and then saved.
How can I group data using pandas?
I was thinking to index the groups like the following table:
but in this case I cannot access the cells with the same index sequentially.
Any idea?
EDIT: here the table from the .csv file:
,X,Y,Z,mm,ff,cc
1,1,2,3,0.2,0.4,0.3
,,,,0.1,0.3,0.4
2,1,2,3,0.1,1.2,-1.2
,,,,0.12,-1.234,303.4
,,,,1.2,43.2,44.3
,,,,7.4,88.3,34.4
3,2,4,2,1.13,4.1,55.1
,,,,80.3,34.1,4.01
,,,,43.12,12.3,98.4

You can create an index and insert into the first position per your desired output. I have also used ffill() to get rid of nulls, but that is optional for you
# without ffill()
df.insert(0, 'index', (df[['X', 'Y', 'Z']].notnull().sum(axis=1) == 3).cumsum())
# df = df.ffill() # uncomment if you want ffill()
df
Out[1]:
index X Y Z mm ff cc
0 1 1.0 2.0 3.0 0.20 0.400 0.30
1 1 NaN NaN NaN 0.10 0.300 0.40
2 2 1.0 2.0 3.0 0.10 1.200 -1.20
3 2 NaN NaN NaN 0.12 -1.234 303.40
4 2 NaN NaN NaN 1.20 43.200 44.30
5 3 2.0 4.0 2.0 1.13 4.100 55.10
6 3 NaN NaN NaN 80.30 34.100 4.01
# with ffill
df = df.ffill()
df
Out[2]:
index X Y Z mm ff cc
0 1 1.0 2.0 3.0 0.20 0.400 0.30
1 1 1.0 2.0 3.0 0.10 0.300 0.40
2 2 1.0 2.0 3.0 0.10 1.200 -1.20
3 2 1.0 2.0 3.0 0.12 -1.234 303.40
4 2 1.0 2.0 3.0 1.20 43.200 44.30
5 3 2.0 4.0 2.0 1.13 4.100 55.10
6 3 2.0 4.0 2.0 80.30 34.100 4.01

Try groupby:
groups = df[['X','Y','Z']].notna().all(axis=1).cumsum()
for k, d in df.groupby(groups):
# do something with the groups
print(f'Group {k}')
print(d)

Related

How to use np.where for creating multiple conditional columns?

I have a dataframe as follows:
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
'Hours1': [4, 4, 8, 8],
'Group2': ['Admin', 'Customer', '0', '0'],
'Hours2': [4.0, 2.0, 0.0, 0.0],
'Group3': ['0', 'Admin', '0', '0'],
'Hours3': [0.0, 2.0, 0.0, 0.0],
})
>>> df
ID Group1 Hours1 Group2 Hours2 Group3 Hours3
0 1 Maintenance 4 Admin 4.0 0 0.0
1 2 Shop 4 Customer 2.0 Admin 2.0
2 3 Admin 8 0 0.0 0 0.0
3 4 Shop 8 0 0.0 0 0.0
I would like to create new columns as follows:
desired output:
This is my code and the current output. I understand why it is not giving me what I want but I'm not sure how to modify my code for the desired output
Code:
segment_list=["Maintenance", "Shop", "Admin", "Customer"]
for i in segment_list:
df["Seg_"+i] = np.where((df["Group1"] ==i) | (df["Group2"]==i) | (df["Group3"]==i),
(df["Hours1"] + df["Hours2"] + df["Hours3"])/8,0)
Current output
Propably not the cleanest way, but it does work and I couldn't come up with a more elegant approach.
print(df)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3
# 0 1 Maintenance 4 Admin 4.0 NaN NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0
# 2 3 Admin 8 NaN NaN NaN NaN
# 3 4 Shop 8 NaN NaN NaN NaN
df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']
# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()
# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')
# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()
#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3 Admin Customer \
# 0 1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25
# 2 3 Admin 8 NaN NaN NaN NaN 1.00 NaN
# 3 4 Shop 8 NaN NaN NaN NaN NaN NaN
# Maintenance Shop
# 0 0.5 NaN
# 1 NaN 0.5
# 2 NaN NaN
# 3 NaN 1.0
Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas easier to use (because of groupby and its handling of index and multi-index):
First, some cleaning and reshaping:
# set ID as index and clean up the '0' entries
# which really should be NaN (missing data):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})
# then, convert 'Group1', ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])
# and convert to a long frame with ['ID', 'Group'] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
'Group', append=True)
>>> z
Hours
ID Group
1 Maintenance 4.0
Admin 4.0
2 Shop 4.0
Customer 2.0
Admin 2.0
3 Admin 8.0
4 Shop 8.0
Now, calculate the desired summaries (here, just one: fraction of hours relative to ID's total):
# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(lambda g: g / g.sum()))
>>> z
Hours Seg
ID Group
1 Maintenance 4.0 0.50
Admin 4.0 0.50
2 Shop 4.0 0.50
Customer 2.0 0.25
Admin 2.0 0.25
3 Admin 8.0 1.00
4 Shop 8.0 1.00
At this point, one could reshape to wide again, with MultiIndex columns:
>>> z.unstack('Group')
Hours Seg
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID
1 4.0 NaN 4.0 NaN 0.50 NaN 0.5 NaN
2 2.0 2.0 NaN 4.0 0.25 0.25 NaN 0.5
3 8.0 NaN NaN NaN 1.00 NaN NaN NaN
4 NaN NaN NaN 8.0 NaN NaN NaN 1.0
], axis=1)
Or, closer to the original intention, we can concat horizontally just the Seg portion to the (cleaned up) original:
df2 = pd.concat([
df2,
z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}'),
], axis=1)
>>> df2
Group1 Hours1 Group2 Hours2 Group3 Hours3 Seg_Admin Seg_Customer Seg_Maintenance Seg_Shop
ID
1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN 0.5 NaN
2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25 NaN 0.5
3 Admin 8 NaN NaN NaN NaN 1.00 NaN NaN NaN
4 Shop 8 NaN NaN NaN NaN NaN NaN NaN 1.0

How to find outliers and invalid count for each row in a pandas dataframe

I have a pandas dataframe that looks like this:
X Y Z
0 9.5 -2.3 4.13
1 17.5 3.3 0.22
2 NaN NaN -5.67
...
I want to add 2 more columns. Is invalid and Is Outlier.
Is Invalid will just keep a track of the invalid/NaN values in that given row. So for the 2nd row, Is Invalid will have a value of 2. For rows with valid entries, Is Invalid will display 0.
Is Outlier will just check whether that given row has outlier data. This will just be True/False.
At the moment, this is my code:
dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])
How can I go about adding these features?
x='''Z,Y,X,W,V,U,T
1,2,3,4,5,6,60
17.5,3.3,.22,22.11,-19,44,0
,,-5.67,,,,
'''
import pandas as pd, io, scipy.stats
df = pd.read_csv(io.StringIO(x))
df
Sample input:
Z Y X W V U T
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0
2 NaN NaN -5.67 NaN NaN NaN NaN
Transformations:
df['is_invalid'] = df.isna().sum(axis=1)
df['is_outlier'] = df.iloc[:,:-1].apply(lambda r: (r < (r.quantile(0.25) - 1.5*scipy.stats.iqr(r))) | ( r > (r.quantile(0.75) + 1.5*scipy.stats.iqr(r))) , axis=1).sum(axis = 1)
df
Final output:
Z Y X W V U T is_invalid is_outlier
0 1.0 2.0 3.00 4.00 5.0 6.0 60.0 0 1
1 17.5 3.3 0.22 22.11 -19.0 44.0 0.0 0 0
2 NaN NaN -5.67 NaN NaN NaN NaN 6 0
Explanation for outlier:
Valid range is from Q1-1.5IQR to Q3+1.5IQR
Since it needs to calculated per row, we used apply and pass each row (r). To count outliers, we flipped the range i.e. anything less than Q1-1.5IQR and greater than Q3+1.5IQR is counted.

pandas: How to merge multiple dataframes with same column names on one column, with duplicate values?

Following this question,
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
I have the exact same situation, but my t column may have duplicates.
I would like to remain for each rows whose t is duplicated, with the one whose data column is maximal.
Following the original example:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.0 3.0
4 1.5 1.5
5 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 3.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
notice t=1.0, y1=3.0 now
How do I do this?
There are three tasks:
drop duplicates on df1
interpolate df2,
merge the two
So here's a solution
(df2.set_index('t')
.reindex(new_idx)
.interpolate('index')
.reset_index()
.merge(df1.sort_values('y1', ascending=False)
.drop_duplicates('t'),
on='t', how='right')
)
Output:
t y2 y1
0 0.0 0.0 0.0
1 0.5 1.5 0.5
2 1.0 3.0 3.0
3 1.5 4.5 1.5
4 2.0 6.0 2.0
if you are dealing with "TIMESTAMPS" then you have yo use datetime package which is one of the important one that individual not focused of and one of the important one for time series forecasting as well

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Pandas dataframe fillna() only some columns in place

I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df
The output:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 NaN 7.0
3 NaN 6.0 8.0
a b c
0 1.0 4.0 0.0
1 2.0 5.0 0.0
2 3.0 0.0 7.0
3 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
You can select your desired columns and do it by assignment:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
The resulting output is as expected:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can using dict , fillna with different value for different column
df.fillna({'a':0,'b':0})
Out[829]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
After assign it back
df=df.fillna({'a':0,'b':0})
df
Out[831]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can avoid making a copy of the object using Wen's solution and inplace=True:
df.fillna({'a':0, 'b':0}, inplace=True)
print(df)
Which yields:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)
This should work and without copywarning
df[['a', 'b']] = df.loc[:,['a', 'b']].fillna(value=0)
Here's how you can do it all in one line:
df[['a', 'b']].fillna(value=0, inplace=True)
Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.
Or something like:
df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0
and if there is more:
for i in your_list:
df.loc[df[i].isnull(),i]=0
For some odd reason this DID NOT work (using Pandas: '0.25.1')
df[['col1', 'col2']].fillna(value=0, inplace=True)
Another solution:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Example:
df = pd.DataFrame(data={'col1':[1,2,np.nan,], 'col2':[1,np.nan,3], 'col3':[np.nan,2,3]})
output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 nan 2.00
2 nan 3.00 3.00
Apply list comp. to fillna values:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 0.00 2.00
2 0.00 3.00 3.00
Sometimes this syntax wont work:
df[['col1','col2']] = df[['col1','col2']].fillna()
Use the following instead:
df['col1','col2']

Categories

Resources