Related
I have a fairly large dataframe which I am trying to combine the columns of in a very specific manner. The original dataframe has 2150 columns and the final dataframe should have around 500 by taking the average of some spread of columns to produce new column. The spread changes which is why I have tried a list which has the start of each column group.
My actual code gets the desired results. However, with the warning,
"PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
df1[str(val)] = df[combine].mean(axis=1)"
I have cannot think of a smart way to use concat for one single combine at the end whilst still taking the mean of each group. I am also new to writing code and any corrections to my style would be appreciated, especially where I have to break out of the loop.
Here is my actual code.
import pandas as pd
df = pd.read_csv("some file location")
new_cols = list(range(350, 702, 3)) + list(range(707, 1398, 6)) + \
list(range(1407, 2098, 10)) + list(range(2112, 2488, 15)) + [2501]
cols = list(map(int, list(df.columns)[1:]))
df1 = df.copy()
for i, val in enumerate(new_cols):
if val == 2501:
break
combine = list(map(str, range(new_cols[i], new_cols[i+1])))
print(combine)
df1 = df1.drop(combine, axis=1, inplace=False)
df1[str(val)] = df[combine].mean(axis=1)
df1.to_csv("data_reduced_precision.csv", index=False)
print("Finished")
Here is a minimal example which shows what I am trying to achieve. It doesn't produce the PerformanceWarning as it has only a few columns. But illustrates my method I hope.
df1 = pd.DataFrame({'1': [1, 2, 3, 4],
'2': [5, 6, 7, 8],
'3': [9, 10, 11, 12],
'4': [13, 14, 15, 16],
'5': [17, 18, 19, 20],
'6': [21, 22, 23, 24],
'7': [25, 26, 27, 28]})
df2 = df1.copy()
# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]
new_cols = [1, 2, 5, 8]
for i, val in enumerate(new_cols):
if val == 8:
break
#All the column names are integers as str
combine = list(map(str, range(new_cols[i], new_cols[i+1])))
df2 = df2.drop(combine, axis=1, inplace=False)
df2[str(val)] = df1[combine].mean(axis=1)
print(df2)
1 2 5
0 1.0 9.0 21.0
1 2.0 10.0 22.0
2 3.0 11.0 23.0
3 4.0 12.0 24.0
I would move your dataframe operations out of your for-loop.
import pandas
df1 = pandas.DataFrame({
'1': [1, 2, 3, 4],
'2': [5, 6, 7, 8],
'3': [9, 10, 11, 12],
'4': [13, 14, 15, 16],
'5': [17, 18, 19, 20],
'6': [21, 22, 23, 24],
'7': [25, 26, 27, 28],
})
# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]
new_cols = [1, 2, 5, 8]
combos = []
for i, val in enumerate(new_cols):
if val != 8:
#All the column names are integers as str
combos.append(list(map(str, range(new_cols[i], new_cols[i+1]))))
df2 = df1.assign(**{
str(maincol): df1.loc[:, combo].mean(axis="columns")
for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]
Unless I'm mistaken, this will pass around references to the original df1 instead of making a bunch of copies (i.e., df2 = df2.drop(...).
Printing out df1, I get:
1 2 5
0 1.0 9.0 21.0
1 2.0 10.0 22.0
2 3.0 11.0 23.0
3 4.0 12.0 24.0
If I scale this up to a 500,000 x 20 dataframe, it completes seemingly instantly without warning on my machine:
import numpy
dfbig = pandas.DataFrame(
data=numpy.random.normal(size=(500_000, 20)),
columns=list(map(str, range(1, 21)))
)
new_cols = [1, 2, 5, 8, 12, 13, 16, 17, 19]
combos = []
for i, val in enumerate(new_cols[:-1]):
combos.append(list(map(str, range(new_cols[i], new_cols[i+1]))))
dfbig2 = dfbig.assign(**{
str(maincol): dfbig.loc[:, combo].mean(axis="columns")
for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]
I have a dataframe consisting of float64 values in it. I have to divide each value by hundred except for the the values of the row of index no. 388. For that I wrote the following code.
Dataset
Preprocessing:
df = pd.read_csv('state_cpi.csv')
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['Month']=df['Name'].map(d)
r = {'Rural':1, 'Urban':2, 'Rural+Urban':3}
df['Region_code']=df['Sector'].map(r)
df['Himachal Pradesh'] = df['Himachal Pradesh'].str.replace('--','NaN')
df['Himachal Pradesh'] = df['Himachal Pradesh'].astype('float64')
Extracting the data of use:
data = df.iloc[:,3:-2]
Applying the division on the data dataframe
data[:,:388] = (data[:,:388] / 100).round(2)
data[:,389:] = (data[:,389:] / 100).round(2)
It returned me a dataframe where the data of row no. 388 was also divided by 100.
Dataset
As an example, I give the created dataframe. Indices except for 10 are copied into the aaa list. These index numbers are then supplied when querying and 1 is added to each element. The row with index 10 remains unchanged.
df = pd.DataFrame({'a': [1, 23, 4, 5, 7, 7, 8, 10, 9],
'b': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
index=[1, 2, 5, 7, 8, 9, 10, 11, 12])
aaa = df[df.index != 10].index
df.loc[aaa, :] = df.loc[aaa, :] + 1
In your case, the code will be as follows:
aaa = data[data.index != 388].index
data.loc[aaa, :] = (data.loc[aaa, :] / 100).round(2)
I am trying to create summaries of unique buyers for each of the different products in my Sales table. My target outcome is as follows:
CustSeg
UNIQUE_PROD1_CUST
0
High
7
1
Low
8
2
Mid
4
This summary is created and assigned to variable as below:
# Count of DISTINCT PROD1 CUSTOMERS
PROD1_CUST = (
Sales_Df.loc[(Sales_Df.Prod1_Qty > 0)]
.groupby("CustSeg")["CustID"]
.count()
.reset_index(name="UNIQUE_PROD1_CUST")
)
PROD1_CUST
The Sales_Df dataframe can be replicated thus:
Sales_Qty = {
"CustID": ['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', ],
"CustSeg": ['High', 'High', 'Mid', 'High', 'Low', 'Low', 'Low', 'Low', 'Low', 'Mid', 'Low', 'Low', 'Mid', 'Low', 'High', 'High', 'High', 'High', 'Mid', 'Low', ],
"Prod1_Qty": [8, 7, 12, 15, 7, 15, 7, 8, 3, 15, 0, 3, 4, 4, 7, 11, 12, 12, 6, 1, ],
"Prod2_Qty": [2, 5, 0, 1, 14, 15, 3, 1, 11, 0, 5, 11, 12, 8, 6, 15, 7, 4, 3, 10, ],
"Prod3_Qty": [13, 4, 0, 11, 3, 5, 11, 11, 10, 14, 2, 4, 3, 14, 14, 10, 5, 0, 0, 9, ],
"Prod4_Qty": [11, 15, 2, 0, 6, 2, 12, 14, 11, 15, 5, 14, 13, 0, 10, 2, 13, 11, 12, 15, ],
"Prod5_Qty": [9, 15, 5, 4, 9, 0, 13, 9, 8, 11, 10, 12, 8, 3, 14, 11, 9, 15, 8, 14, ]
}
Sales_Df = pd.DataFrame(Sales_Qty)
Sales_Df
Now, in real life, the dataframe's shape is larger by far (at least (5000000, 130)), which makes manual repeat of the summary for each of the products tenuous so I am trying to automate the creation of the variables and the summary. I am approaching the task with the following steps.
# Extract the proposed variable names from the dataframe column names.
all_cols = Sales_Df.columns.values.tolist()
# Remove non-product quantity columns from the list
not_prod_cols = ["CustSeg", "CustID"]
prod_cols = [x for x in all_cols if x not in not_prod_cols]
I know the next things should be:
creating the variable names from the list prod_cols and storing
those variables in a list - let's name the list prod_dfs
prod_dfs = []
Creating the dynamic formula that creates the dataframes and append
their variable names to prod_dfs using the "logic" below.
for x in prod_cols:
x[:-4] + "_CUST" = (
Sales_Df.loc[(Sales_Df.x > 0)]
.groupby("CustSeg")["CustID"]
.count()
.reset_index(name="UNIQUE" + x[:-4] + "_CUST")
)
prod_dfs.append(x)
This is where I am stuck. Kindly assist.
Thank you for sharing a reproducible example, and seems like you have made good progress. If I understand correctly, you want to be able to compute the number of unique customers per segment who have purchased a given item.
To follow your approach, you could iterate through the product columns, compute the counts, and assign this to a results dataframe:
prod_cols = [col for col in Sales_Df.columns if col.startswith('Prod')]
result = None
for prod in prod_cols:
counts = (
Sales_Df
.loc[Sales_Df[prod] > 0]
.groupby('CustSeg')
[prod]
.count()
)
if result is None:
result = counts.to_frame()
else:
result[prod] = counts
CustSeg
Prod1_Qty
Prod2_Qty
Prod3_Qty
Prod4_Qty
Prod5_Qty
High
7
7
6
6
7
Low
8
9
9
8
8
Mid
4
2
2
4
4
This would help you in the second dimension, in the sense that you do not have to write this aggregation code for all your columns.
However, the resulting code is not very efficient because it does O(m) groupby operations, where m is the number of columns.
You can get your desired result with slightly different logic.
Form groups of each customer segment.
For each product, count the number of purchasers
Combine the results
This one liner implements this logic.
Sales_Df.drop('CustID', axis=1).groupby('CustSeg').apply(lambda group: (group>0).sum(axis=0))
Note that we first drop CustID because in your example, after grouping by CustSeg, it is the only column that is not a product quantity.
As an aside: consider reviewing the pandas indexing basics. You may find it easier to use the syntax of df['A'] rather than df.A because it allows you to use other programming constructs more effectively.
so i am kind of stuck here, my data is something like this
df = pd.DataFrame({'X': [1, 2, 3, 4, 5, 4, 3, 2, 1],
'Y': [6, 7, 8, 9, 10, 9, 8, 7, 6],
'Z': [11, 12, 13, 14, 15, 14, 13, 12, 11]})
id like to write a code to set the values of the rows 6 to 9 of the column 'Z' to NaN.
the best ive come to do is:
df.replace({'Z': { 6: np.NaN, 7: np.NaN }})
but all this does is replaces values for the new value if set in column Y.
i am confused as to how to change the values of the rows in a column if some values are same in that particular column.
You can use the loc indexer for your dataframe. I've used column 6 to 8 because df doesn't have a column 9:
df.loc[range(6, 9), 'Z'] = pd.NA
you could use:
df.Z[6:9] = np.NaN
I think you should use .iloc for this.
First of all, the index is zero based, so there is no row 9.
To change the values from row 5 to 8 on column 'Z' to pd.NA you could do something like this:
df.iloc[6:9,2:] = pd.NA
I'm assuming Pandas > 1.0 which introduced NA values.
The dataframe looks like this:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
5, 3713.110107421875, 2012-01-07T03:14:03.937Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
7, 3713.89892578125, 2012-01-07T03:14:13.968Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
10, 3714.64990234375, 2012-01-07T03:14:24.015Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
12, 3714.64990234375, 2012-01-07T03:14:29.031Z
At some rows, there are lines with millisecond different timestamps, I want to drop them and only keep the rows that have different second timestamps. there are rows that have the same value for milliseconds and seconds different rows like from row 9 to 12, therefore, I can't use a.loc[a.shift() != a]
The desired output would be:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
Try:
df.groupby(pd.to_datetime(df[2]).astype('datetime64[s]')).head(1)
I hope it's self-explained.
You can use below script. I didn't get your dataframe column names so I invented below columns ['x', 'date_time']
df = pd.DataFrame([
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:43.859Z')),
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:48.890Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:53.906Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:58.921Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.900Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.937Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.900Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.968Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:19.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.015Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.031Z'))],
columns=['x', 'date_time'])
create a column 'time_diff' to get the difference between the
datetime of current row and next row
only get those difference either
None or more than 1 second
drop temp column time_diff
df['time_diff'] = df.groupby('x')['date_time'].diff()
df = df[(df['time_diff'].isnull()) | (df['time_diff'].map(lambda x: x.seconds > 1))]
df = df.drop(['time_diff'], axis=1)
df