Efficiently converting a large pandas dataframe to a list

Efficiently converting a large pandas dataframe to a list - python

I would like to convert the first 50 items in a large pandas dataframe into a list, that for each index in the dataframe the list would have the value. And even if the dataframe doesn't have any values in that index, I would like the list to have the value 0.
For example the pandas dataframe which looks like this:
ID Count
0 20
1 50
2 60
4 90
5 20
.
49 65
.
9999999 60054
would be converted to the following list, with only the first 50 elements of the dataframe being relevant:
[20, 50, 60, 0, 90, 20......,65]
Note that at index=3 , the value in the list is 0, because the ID was not found in the pandas dataframe.

If I understand correctly:
mylist = (df.iloc[:50].set_index('ID')
.reindex(range(50), fill_value=0)['Count']
.tolist())

IIUC:
d = df.query('ID < 5')
m = dict(zip(*map(d.get, d)))
[m.get(i, 0) for i in range(5)]
[20, 50, 60, 0, 90]

Related

How to fill a column on a dataframe based on given conditions using a for-loop

I have a pandas dataframe where the column ARRIVAL_DELAY is filled with different values (float64). I want to insert a new column COMPENSATION which will be filled depending on the values from ARRIVAL_DELAY.
I want it to look like this:
ARRIVAL_DELAY COMPENSATION
10 0
25 0
43 50
61 250
I came up with the code below but it just seems that is not working at all (it is taking literally hours on Jupyter notebooks and it is not even completing anything). Also, no errors nor warnings are displayed:
fdf.insert(13, 'COMPENSATION', 0)
compensacion = [0, 50, 100, 250, 500, 1000]
for row in fdf['ARRIVAL_DELAY']:
if row > 0 and row <= 15 : fdf['COMPENSATION'].add(compensacion[0])
elif row > 15 and row <= 30 : fdf['COMPENSATION'].add(compensacion[1])
elif row > 30 and row <= 60 : fdf['COMPENSATION'].add(compensacion[2])
elif row > 60 and row <= 120 : fdf['COMPENSATION'].add(compensacion[3])
elif row > 120 and row <= 180 : fdf['COMPENSATION'].add(compensacion[4])
else :fdf['COMPENSATION'].add(compensacion[5])
fdf.head(10)
I do not understand what's wrong, any ideas?
Finally, I am kind of new to Python so if anyone has an improvement idea, it will be more than welcomed 😃
Thank you!

This can be accomplished using np.select which is optimized as well as makes your code look a little neater and readable
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARRIVAL_DELAY' : [10, 25, 30, 61]
})
condition_list = [
df['ARRIVAL_DELAY'].between(1, 15),
df['ARRIVAL_DELAY'].between(16, 30),
df['ARRIVAL_DELAY'].between(31, 60),
df['ARRIVAL_DELAY'].between(61, 120),
df['ARRIVAL_DELAY'].between(121, 180)
]
choice_list = [0, 50, 100, 250, 500]
df['COMPENSATION'] = np.select(condition_list, choice_list, 1000)
df

Try to avoid iterating over the rows of your pandas dataframe. In fact, do it only as a last resort. Pandas and numpy provide many vecotorized functions that are highly optimized. Iterating over large dataframes could be excruciatingly slow. (Read this to understand more.)
numpy.select is an excellent option to solve your problem as provided by #ArchAngelPwn. An alternative is to use the pandas cut() which can bin continuous values to discrete intervals, and also is very efficient.
df = pd.DataFrame([10, 25, 43, 61], columns=['ARRIVAL_DELAY'])
df['COMPENSATION'] = pd.cut(df.ARRIVAL_DELAY,
[0, 15, 30, 60, 120, 180, np.inf],
labels=[0, 50, 100, 250, 500, 1000])
print(df)
ARRIVAL_DELAY COMPENSATION
0 10 0
1 25 50
2 43 100
3 61 250

Is there a more efficient or concise way to divide a df according to a list of indexes?

I'm trying to slice/divide the following dataframe
df = pd.DataFrame(
{'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
)
according to a list of indexes to split on :
[5, 7, 9]
The first and last items of the list are the first and last indexes of the dataframe. I'm trying to get the following four dataframes as a result (defined by the three given indexes and the beginning and end of the original df) each assigned to their own variable:
time value
0 4 0
1 10 0
2 15 0
3 6 50
4 0 100
time value
5 20 0
6 40 0
time value
7 11 70
8 9 100
time value
9 12 0
10 11 100
11 25 20
My current solution gives me a list of dataframes that I could then assign to variables manually by list index, but the code is a bit complex, and I'm wondering if there's a simpler/more efficient way to do this.
indexes = [5,7,9]
indexes.insert(0,0)
indexes.append(df.index[-1]+1)
i = 0
df_list = []
while i+1 < len(indexes):
df_list.append(df.iloc[indexes[i]:indexes[i+1]])
i += 1
This is all coming off of my attempt to answer this question. I'm sure there's a better approach to that answer, but I did feel like there should be a simpler way to do this kind of slicing that what I thought of.

you can use np.split like
df_list = np.split(df, indexes)

How can I make this kind of aggregation in pandas?

I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??

The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.

it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Pandas DataFrame - insert copy of row with some changes

Say I have
import pandas as pd
x = pd.DataFrame.from_dict({'A':[1,2,3,4,5,6], 'B':[10, 20, 30, 44, 48, 81]})
And I want to insert a copy of the row x[5], but in it add +2 to 'A' value, +7 to 'B' value. How can I do this?
Obviously in the real example the dataframe has many more columns, that's why it makes sense for me to copy a row rather than manually populate the value for each column in it.

First build the dataframe for then one you need creat the copy from original dataframe, the we adjust the value in it , then concat it back
x1=x.loc[[5],:]
x1.A+=2
x1.B+=7
x_new = pd.concat([x,x1]).sort_index()
x_new
Out[291]:
A B
0 1 10
1 2 20
2 3 30
3 4 44
4 5 48
5 6 81
5 8 88

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently converting a large pandas dataframe to a list - python

If I understand correctly: mylist = (df.iloc[:50].set_index('ID') .reindex(range(50), fill_value=0)['Count'] .tolist())

IIUC: d = df.query('ID < 5') m = dict(zip(*map(d.get, d))) [m.get(i, 0) for i in range(5)] [20, 50, 60, 0, 90]

Related

How to fill a column on a dataframe based on given conditions using a for-loop

Is there a more efficient or concise way to divide a df according to a list of indexes?

How can I make this kind of aggregation in pandas?

pandas replace all values of a column with a column values that increment by n starting at 0

Pandas DataFrame - insert copy of row with some changes

Categories

Resources