I try to handle the next data issue. I have a dataframe of values and their labels list (this is multi-class, so the labels are a list).
The dataframe looks like:
| value| labels
---------------------
row_1| A |[label1]
row_2| B |[label2]
row_3| C |[label3, label4]
row_4| D |[label4, label5]
I want to find all rows that have a specific label and then:
Firstly, concatenate it with the next row - the string will be concatenated before the next row's value.
Secondly, the labels will be appended to the label list of the next row
For example, if I want to do that for label2, the desired output will be:
| value| labels
---------------------
row_1| A |[label1]
row_3| BC |[label2, label3, label4]
row_4| D |[label4, label5]
The value "B" is joined before the next row's values, and the label "label2" will be appended to the beginning of the next row's label list. The indexes are not relevant for me.
I would greatly appreciate help with this. I tried to use, merge, join, shift, and cumsum but without success so far.
The following code creates the data in the example:
data = {'row_1': ["A", ["label1"]], 'row_2': ["B", ["label2"]],
'row_3':["C", ["label3", "label4"]], 'row_4': ["D", ["label4", "label5"]]}
df = pd.DataFrame.from_dict(data, orient='index').rename(columns={0: "value", 1: "labels"})
You could create a grouping variable and use that to aggregate the columns
import pandas as pd
import numpy as np
def my_combine(data, value):
index = data['labels'].apply(lambda x: np.isin(value, x))
if(all(~index)):
return data
idx = (index | index.shift()).to_numpy()
vals = (np.arange(idx.size) + 1) *(~idx)
gr = np.r_[np.where(vals[1:] != vals[:-1])[0], vals.size - 1]
groups = np.repeat(gr, np.diff(np.r_[-1, gr]) )
return data.groupby(groups).agg(sum)
my_combine(df, 'label2')
value labels
0 A [label1]
2 BC [label2, label3, label4]
3 D [label4, label5]
Related
how to count between A and B in same columns dataframe (pandas)
count "CUT" in m/c code column between "STD" and "STD"
which are repeated in many time in columns
see below image attached
Another solution to your problem would be to create an auxiliary column that labels each interval. Then you can apply a groupby alongside the transform method, to perform the counting. Here's the code:
from __future__ import annotations
import pandas as pd
import numpy as np
# == Helper functions (Not part of the actual solution) =======================
# You can ignore these functions, as they don't actually are part of the
# solution, but rather a way to generate some data to test the implementation.
def random_dates(
start_date: str | pd.Timestamp,
end_date: str | pd.Timestamp,
size: int = 10,
) -> pd.DatetimeIndex:
"""Generate random dates between two dates.
Parameters
----------
start_date : str | pd.Timestamp
Start date.
end_date : str | pd.Timestamp
End date.
size : int, optional
Number of dates to generate, by default 10.
Returns
-------
pd.DatetimeIndex
Random dates.
Examples
--------
>>> random_dates("2020-01-01", "2020-01-31", size=5) # doctest: +ELLIPSIS
DatetimeIndex(['2020-01-05', '2020-01-12', ...], dtype='datetime64[ns]', freq=None)
"""
start_u = pd.to_datetime(start_date).value // 10**9
end_u = pd.to_datetime(end_date).value // 10**9
return pd.to_datetime(np.random.randint(start_u, end_u, size), unit="s")
def generate_random_frame(
start_date: str | pd.Timestamp,
end_date: str | pd.Timestamp,
size: int = 10,
) -> pd.DataFrame:
"""
Generate a DataFrame to test the solution.
Parameters
----------
start_date : str | pd.Timestamp
Start date. Must be a string representing a date, like "YYYY-MM-DD",
or "YYYY-MM-DD HH:MM:SS". Optionally, can also be a pandas Timestamp
object.
end_date : str | pd.Timestamp
End date. Must be a string representing a date, like "YYYY-MM-DD",
or "YYYY-MM-DD HH:MM:SS". Optionally, can also be a pandas Timestamp.
size : int, default 10
Number of rows to generate.
Returns
-------
pd.DataFrame
DataFrame with random dates and random values. The resulting DataFrame
has the following columns:
- "Time": random datetimes between `start_date` and `end_date`.
- "m/c code": random strings from a set of 7 possible values:
"END", "CUT", "STD", "BL1", "ALS", "ST1", or "CLN".
"""
mc_code_choices = ["END", "CUT", "STD", "BL1", "ALS", "ST1", "CLN"]
return pd.DataFrame(
{
"Time": random_dates(start_date, end_date, size),
"m/c code": np.random.choice(mc_code_choices, size),
}
)
# == Solution ==================================================================
def flag_groups_and_count(
df: pd.DataFrame,
group_colname: str = "m/c code",
lowbound_value: str = "END",
upbound_value: str = "STD",
value_to_count: str = "CUT",
count_colname: str = "Count",
flag_colname: str = "FLAG",
) -> pd.DataFrame:
"""
Flag groups and count the number of times a specified value appears on each group.
Groups are defined by values between `lowbound_value` and `upbound_value`
in the column `group_colname`. The flag is set to 1 for the first group.
Subsequent groups are flagged as 2, 3, etc. Groups are flagged as 0
represent the absence of a group.
After flagging the groups, function counts the number of times
the value specified by the `value_to_count` parameter appears.
Parameters
----------
df : pd.DataFrame
DataFrame to flag.
group_colname : str, default "m/c code"
Column name to group by.
lowbound_value : str, default "END"
Value to start each group.
upbound_value : str, default "STD"
Value to end each group.
value_to_count : str, default "CUT"
Value to count inside the groupby.
count_colname : str, default "Count"
Name of the column to store the counts.
flag_colname : str, default "FLAG"
Name of the column to store each group.
Returns
-------
pd.DataFrame
Original DataFrame with the added flag column.
"""
# Set the initial parameters, used to control the creation of the groups.
current_group = 1 # The current group number.
flag_row = False # Indicates whether the current row should be flagged.
# Create the column that stores the group numbers.
# Set all values initially to 0
df[flag_colname] = 0
# Iterate over each row of the dataframe.
# - index: index of each row. Same values you find by calling df.index
# - row: a pandas Series object with the values of each row.
for index, row in df.iterrows():
# If the current row has a 'm/c code' value equal to 'END',
# then set the flag_row variable to True to indicate that
# the next rows should be set to `current_group` untill
# it finds a row with 'm/c code' value that equals to 'STD'.
if row[group_colname] == lowbound_value:
flag_row = True
# Does this row belong to a group? If so, set it to `current_group`.
if flag_row:
df.loc[df.index.isin([index]), flag_colname] = current_group
# If the current row has a 'm/c code' value equal to 'STD',
# then we reached the end of a group. Set the flag_row variable
# to False indicating that the next rows should not be flagged to a
# group.
if row[group_colname] == upbound_value:
# Did we reach the end of a group, or simply found another value
# equal to "STD" before the next interval starts?
# This is to avoid incrementing the group number when in fact we didn't
# reach a new interval.
if flag_row:
current_group += 1
flag_row = False
# Groupby 'm/c code' column values by the newly created flag column.
# Inside this groupby, use the `transform` method to count the number of
# times the value "CUT" appears inside each group.
# Store the count in a new column called "Count".
df[count_colname] = df.groupby(flag_colname, as_index=False)[
group_colname
].transform(lambda group: (group == value_to_count).sum())
# Same as:
# df["Count"] = df.groupby("FLAG", as_index=False)[
# "m/c code"
# ].transform(lambda group: (group == "CUT").sum())
# When the flag column is equal to 0, it means that there's no interval.
# Therefore, set such counts to 0. Intervals represent the rows with
# values for the 'm/c code' column between adjacent "END" and "STD" values.
df.loc[test_df[flag_colname] == 0, count_colname] = 0
# Same as: test_df.loc[test_df["FLAG"] == 0, "Count"] = 0
return df
# == Test our code =============================================================
# Parameters to use for generating test DataFrame:
start_date = "2021-07-01 00:00:00"
end_date = "2021-07-03 00:00:00"
# Generate test DataFrame
test_df = generate_random_frame(start_date, end_date, size=30)
# Call the function that defines the implementation to the problem.
test_df = test_df.pipe(flag_groups_and_count)
test_df
Here's a screenshot of the output:
There are obv many different ways.
One solution is this:
import pandas as pd
def c_counter(df, A, B, C):
starts = [i for i, x in enumerate(df['m/c code']) if x == 'A']
ends = [i for i, x in enumerate(df['m/c code']) if x == 'B']
df['Count'] = ''
for start, end in zip(starts, ends):
df['Count'][start:end] = sum(df['m/c code'][start:end] == 'C')
return(df)
df = pd.DataFrame({'m/c code': ['A', 'X', 'C', 'X', 'C', 'X', 'B', 'X', 'A', 'C', 'X', 'B']})
A = 'A'
B = 'B'
C = 'C'
c_counter(df, A, B, C)
Out:
m/c code Count
0 A 2
1 X 2
2 C 2
3 X 2
4 C 2
5 X 2
6 B
7 X
8 A 1
9 C 1
10 X 1
11 B
Next time, please make sure to include sample code.
I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given
You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]
Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10
I have a dataframe:
Text
Background
Clinical
Method
Direct
Background
Direct
Now I want to group them in new column according to their first words like Background belong to group 1 Clinical belongs to group 2 and like this.
The expected output:
a dataframe:
Text Group
Background 1
Clinical 2
Method 3
Direct 4
Background 1
Direct 4
Try this:
import pandas as pd
text = ['Background', 'Clinical', 'Method', 'Direct', 'Background', 'Direct']
df = pd.DataFrame(text, columns=['Text'])
def create_idx_map():
idx = 1
values = {}
for item in list(df['Text']):
if item not in values:
values[item] = idx
idx += 1
return values
values = create_idx_map()
df['Group'] = [values[x] for x in list(df['Text'])]
print(df)
Idea: Make a list of unique values of the column Text and for the column Group you can assign the index of the value in this unique list. Code example:
df = pd.DataFrame({"Text": ["Background", "Clinical", "Clinical", "Method", "Background"]})
# List of unique values of column `Text`
groups = list(df["Text"].unique())
# Assign each value in `Text` its index
# (you can write `groups.index(text) + 1` when the first value shall be 1)
df["Group"] = df["Text"].map(lambda text: groups.index(text))
# Ouptut for df
print(df)
### Result:
Text Group
0 Background 0
1 Clinical 1
2 Clinical 1
3 Method 2
4 Background 0
A solution could be the following:
import pandas as pd
data = pd.DataFrame([["A B", 1], ["A C", 2], ["B A", 3], ["B C", 5]], columns=("name", "value"))
data.groupby(by=[x.split(" ")[0] for x in data.loc[:,"name"]])
You can select the first few words using x.split(" ")[:NUMBER_OF_WORDS]. You then apply the aggregation you want to the need object
I have a df as follows:
QUESTIONCODE SUBJETCS
1 English
1 French
2 English
3 Japanese
4 English
4 Japanese
And I would like to create a pivot table where both the index and the column would be my SUBJECTS unique values from df, and it would be filled by the number of QUESTIONCODE's belonging to the combination of SUBJECTs represented by each index and column. Then, the result would be:
English French Japanese
English 3 1 1
French 1 1 0
Japanese 1 0 2
I have already tried some ways using pandas functions as groupby, pivot_table and crosstab, but I still could not get the result shown above.
Could anyone please help me on that?
I was able to find a solution, but by no means it is the best solution. I believe it will allow to get started. I provided some comments with the code. Let me know if you have any questions. Cheers.
import pandas as pd
import itertools
import collections
# this is your data
df = pd.DataFrame({"QUESTIONCODE": [1,1,2,3,4,4],
"SUBJETCS": ["English", "French", "English", "Japanese", "English", "Japanese"]})
df["SUBJETCS_"] = df["SUBJETCS"] # I am duplicating the subject column here
# pivoting to get the counts of each subject
dfPivot = df.pivot_table(index="SUBJETCS", columns="SUBJETCS_", values="QUESTIONCODE", aggfunc=["count"], fill_value=0).reset_index()
dfPivot.columns = ["SUBJETCS"] + sorted(df["SUBJETCS"].unique())
x = df.groupby("QUESTIONCODE")["SUBJETCS"].apply(",".join).reset_index() # for each QUESTIONCODE taking its subjects as a DataFrame
dictCombos = dict(zip(x["QUESTIONCODE"].tolist(), [s.split(",") for s in x["SUBJETCS"].tolist()])) # this will map QUESTIONCODE to its subject as a dictionary
list_all_pairs = [] # this will have all possible pair of subjects
for k, v in dictCombos.items():
# v = list(set(v.split(",")))
prm = list(itertools.permutations(v, 2))
list_all_pairs.extend(prm)
dictMap = {c: i for i, c in enumerate(dfPivot.columns[1:])} # just maps each subject to an index
dictCounts = dict(collections.Counter(list_all_pairs)) # dictionary of all pairs to its counts
dictCoords = {} # indexing each subjects i.e. English 0, French 1, ..., this will allow to load as matrix
for pairs, counts in dictCounts.items():
coords = (dictMap[pairs[0]], dictMap[pairs[1]])
dictCoords[coords] = counts
x = dfPivot.iloc[:, 1:].values # saving the content of the pivot into an 2 dimensional array
for coords, counts in dictCoords.items():
x[coords[0], coords[1]] = counts
dfCounts = pd.DataFrame(x, columns=dfPivot.columns[1:]) # loading the content of array into a DataFrame
df = pd.concat([dfPivot.iloc[:, 0], dfCounts], axis=1) # and finally putting it all together
That's the code I mentioned on the comment of griggy's ANSWER!
import pandas as pd
import itertools
import collections
# Creating the dataframe
df = pd.DataFrame({"QUESTIONCODE": [1,1,2,3,4,4],
"SUBJETCS": ["English", "French", "English", "Japanese", "English", "Japanese"]})
# Pivoting to get the counts of each subject
dfPivot = pd.pivot_table(df, values='QUESTIONCODE', index='SUBJETCS', columns='SUBJETCS', aggfunc='count', fill_value=0)
# Creating a dataframe with each QUESTIONCODE and its SUBJECTs
x = df.groupby("QUESTIONCODE")["SUBJETCS"].apply(",".join).reset_index()
# Mapping QUESTIONCODE to its SUBJECTs as a dictionary
dictCombos = dict(zip(x["QUESTIONCODE"].tolist(), [s.split(",") for s in x["SUBJETCS"].tolist()]))
# Creating a list with all possible pair of SUBJECTs
list_all_pairs = []
for k, v in dictCombos.items():
prm = list(itertools.permutations(v, 2))
list_all_pairs.extend(prm)
# Creating a dictionary of all pairs of Materias and its counts
dictCounts = dict(collections.Counter(list_all_pairs))
# Filling the dfPivot dataframe with all pairs of Materias and its counts
for pairs, counts in dictCounts.items():
dfPivot.loc[pairs] = counts
I have two problems with filling out a very large dataframe. There is a section of the picture. I want the 1000 in E and F to be pulled down to 26 and no further. In the same way I want the 2000 to be pulled up to -1 and down to the next 26. I thought I could do this with bfill and ffill, but unfortunately I don't know how...(picture1)
Another problem is that columns occur in which the values from -1 to 26 do not contain any values in E and F. How can I delete or fill them with 0 so that no bfill or ffill makes wrong entries there?
(picture2)
import pandas as pd
import numpy as np
data = '/Users/Hanna/Desktop/Coding/Code.csv'
df_1 = pd.read_csv(data,usecols=["A",
"B",
"C",
"D",
"E",
"F",
],nrows=75)
base_list =[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
df_c = pd.MultiIndex.from_product([
[4000074],
["SP000796746","SP001811642"],
[201824, 201828, 201832, 201835, 201837, 201839, 201845, 201850, 201910, 201918, 201922, 201926, 201909, 201916, 201918, 201920],
base_list],
names=["A", "B", "C", "D"]).to_frame(index=False)
df_3 = pd.merge(df_c, df_1, how='outer')
To understand it better, I have shortened the example a bit. Picture 3 shows how it looks like when it is filled and picture 4 shows it correctly filled
Assuming you have to find and fill values for a particular segment.
data = pd.read_csv('/Users/Hanna/Desktop/Coding/Code.csv')
for i in range(0,data.shape[0],27):
if i+27 < data.shape[0]:
data.loc[i:i+27,'E'] = max(data['E'].iloc[i:i+27])
else:
data.loc[i:data.shape[0],'E'] = max(data['E'].iloc[i:data.shape[0]])
you can replace the max to whatever you want.
could find the indexes where you have -1 and then slice/loop over the columns to fill.
just to create the sample data:
import pandas as pd
df = pd.DataFrame(columns=list('ABE'))
df['A']=list(range(-1, 26)) * 10
add random values at each section
import random
for i in df.index:
if i%27 == 0:
df.loc[i,'B'] = random.random()
else:
df.loc[i, 'B'] = 0
find the indexes to slice over
indx = df[df['A'] == -1].index.values
fill out data in column "E"
for i, j in zip(indx[:-1], indx[1:]):
df.loc[i:j-1, 'E'] = df.loc[i:j-1, 'B'].max()
if j == indx[-1]:
df.loc[j:, 'E'] = df.loc[j:, 'B'].max()