Keep specific columns from a dataframe pandas

Keep specific columns from a dataframe pandas - python

I have a dataframe from an import csv using pandas. This dataframe has 160 variables and I would like to keep only 5, 9, 10, 46, 89.
I try this:
dataf2 = dataf[[5] + [9] + [10] + [46] + [89]]
but I take this error:
KeyError: '[ 5 9 10 46 89] not in index'

If you want to refer to columns not by their names but by their positions in the dataset, you need to use df.iloc:
dataf.iloc[:, [5, 9, 10, 46, 89]]
Row indices are specified before the comma, column indices are specified after the comma.

If the columns that you would like to keep are: 5, 9, 10, 46, 89, then you can index just these ones like so:
dataf2 = dataf[[5, 9, 10, 46, 89]]

Related

How to return every N alternate rows from a pandas dataframe?

Let's say I have a dataframe with 1000 rows. Is there an easy way of slicing the datframe in sucha way that the resulting datframe consisits of alternating N rows?
For example, I want rows 1-100, 200-300, 400-500, ....and so on and skip 100 rows in between and create a new dataframe out of this.
I can do this by storing each individial slice first into a new dataframe and then append at the end, but I was wondering if there a much simpler way to do this.

You can use:
import numpy as np
out = df[np.arange(len(df))%200<100]
for the demo here is an example with 1-10, 20-30, etc.
df = pd.DataFrame(index=range(100))
out = df[np.arange(len(df))%20<10]
out.index
output:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, # rows 1-10
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, # rows 20-30
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, # rows 30-50
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, # rows 50-70
80, 81, 82, 83, 84, 85, 86, 87, 88, 89],# rows 80-90
dtype='int64')

You can use list comprehension and a simple math operation to select specific rows.
If you don't know, % is the modulo operation in Python, which returns the remainder of a division between two numbers.
The int function, instead, eliminates the decimal digits from a number.
Let df be your dataframe and N be your interval (in your example N=100):
N = 100
df.loc[[i for i in range(df.shape[0]) if int(i/N) % 2 == 0]]
This will return rows with indexes 0-99, 200-299, 400-499, ...

Convert dict with lists in values to Pandas dataframe [duplicate]

This question already has answers here:
Create pandas dataframe from nested dict with outer keys as df index and inner keys column headers
(2 answers)
Closed 5 months ago.
I have a dictionary with the following keys and values
my_dict={'1':{'name':'one',
'f_10':[1,10,20,30],
'f_20':[1,20,40,60]},
'2':{'name':'two',
'f_10':[2,12,22,32],
'f_20':[2,22,42,62]}}
How do I convert it to a Pandas DataFrame that will look like:
name name f_10 f_20
1 one [1,10,20,30] [1,10,20,60]
2 two [2,12,22,32] [2,22,42,62]
The lists need to be considered in one column based on the key, if I try to concat these get converted to separate rows in the data frame.

Simply use orient=index when importing your data using from_dict:
df = pd.DataFrame.from_dict(my_dict, orient = 'index')
This returns:
name f_10 f_20
1 one [1, 10, 20, 30] [1, 20, 40, 60]
2 two [2, 12, 22, 32] [2, 22, 42, 62]

I would parse that dictionary to DataFrame and have it transposed. For example,
pd.DataFrame(my_dict).T
Result
name f_10 f_20
1 one [1, 10, 20, 30] [1, 20, 40, 60]
2 two [2, 12, 22, 32] [2, 22, 42, 62]

python stoped iterator while trying to create lists of sequences and inserting it in excel using panda

I am trying to create a list of 6 numbers lists from 1 to 49 throw looping from 1 to 49 and creating all possible sets of 1 to 49 .
the issue is that code stops at number 15 and in Pycharm nothing is being printed (excel file is being written but stops at 38759 record)
import itertools
import pandas as pd
stuff = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
all=[]
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
alist=list(subset)
if len(subset)==6:
all.append(alist)
all_tuple=tuple(all)
df = pd.DataFrame(all_tuple,columns=['z1','z2','z3','z4','z5','z6'])
print(df)
df.to_excel('test.xlsx')

If I understand correctly, you are trying to find the possible combinations of 6 numbers sampled from the list [1, 2, 3, ..., 49] without replacement.
But your code calculates the combinations of all lengths and then only saves those of length 6.
To get a clue as to why your code does not terminate quickly, consider the number of combinations of 6 numbers:
>>> print(len(list(itertools.combinations(range(1, 50), 6))))
13983816
So, if there are 14 million possible combinations of 6 numbers, imagine how many combinations there are of 7, 8, 9, ...
Here is some code to calculate only the 14 million combinations of length 6:
combs = list(itertools.combinations(range(1, 50), 6))
Or, if you really want to build the dataframe:
# Warning, this takes about 25 seconds
combs = itertools.combinations(range(1, 50), 6)
df = pd.DataFrame(combs, columns=['z1','z2','z3','z4','z5','z6'])
Bear in mind that this will take up quite a bit of memory. I'm not sure if Excel can handle 14 million rows so I didn't risk it.
Also, don't use reserved keywords for variable names. all is a built in Python function.

Create rolling window from pandas dataframe

I am playing with time series data, and I want to train a model to predict future outcomes. I have some data shaped like the following:
Date Failures
0 2021-06 10
1 2021-05 2
2 2021-04 7
3 2021-03 9
4 2021-02 3
...
I would like to shape this data, not necessarily a pandas df, as a rolling window with four entries:
10 2 7 9 3
...
and then the fifth entry being the number I want to predict. I have read on stack exchange that one wants to avoid iterating over pandas DataFrame, so what would be the appropriate manner to transform my dataframe? I have heard of .rolling method, however, this does not seem to achieve what I want.

IIUC, you want to reshape you column to shape (49,5) when it has an initial length of 245. You can use the underlying numpy array:
df['Failures'].values.reshape(-1,5)
Output (dummy numbers):
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
...
[235, 236, 237, 238, 239],
[240, 241, 242, 243, 244]])

Cummulative addition in a loop

I am trying to cummatively add a value to the previous value and each time, store the value in an array.
This code is just part of a larger project. For simplicity i am going to define my variables as follows:
ele_ini = [12]
smb = [2, 5, 7, 8, 9, 10]
val = ele_ini
for i in range(len(smb)):
val += smb[i]
print(val)
elevation_smb.append(val)
Problem
Each time, the previous value stored in elevation_smb is replaced by the current value such that the result i obtain is:
elevation_smb = [22, 22, 22, 22, 22, 22]
The result i am expecting however is
elevation_smb = [14, 19, 26, 34, 43, 53]
NOTE:
ele_ini is a vector with n elements. I am only using 1 element just for simplicity.

Don use loops, because slow. Better is fast vectorized solution below.
I think need numpy.cumsum and add vector ele_ini for 2d numpy array:
ele_ini = [12, 10, 1, 0]
smb = [2, 5, 7, 8, 9, 10]
elevation_smb = np.cumsum(np.array(smb)) + np.array(ele_ini)[:, None]
print (elevation_smb)
[[14 19 26 34 43 53]
[12 17 24 32 41 51]
[ 3 8 15 23 32 42]
[ 2 7 14 22 31 41]]

It seems vector in your case is using pointers. That's why it is not creating new values. Try adding copy() which copies the value.
elevation_smb.append(val.copy())

Do with reduce,
In [6]: reduce(lambda c, x: c + [c[-1] + x], smb, ele_ini)
Out[6]: [12, 14, 19, 26, 34, 43, 53]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep specific columns from a dataframe pandas - python

I have a dataframe from an import csv using pandas. This dataframe has 160 variables and I would like to keep only 5, 9, 10, 46, 89. I try this: dataf2 = dataf[[5] + [9] + [10] + [46] + [89]] but I take this error: KeyError: '[ 5 9 10 46 89] not in index'

If you want to refer to columns not by their names but by their positions in the dataset, you need to use df.iloc: dataf.iloc[:, [5, 9, 10, 46, 89]] Row indices are specified before the comma, column indices are specified after the comma.

If the columns that you would like to keep are: 5, 9, 10, 46, 89, then you can index just these ones like so: dataf2 = dataf[[5, 9, 10, 46, 89]]

Related

How to return every N alternate rows from a pandas dataframe?

Convert dict with lists in values to Pandas dataframe [duplicate]

python stoped iterator while trying to create lists of sequences and inserting it in excel using panda

Create rolling window from pandas dataframe

Cummulative addition in a loop

Categories

Resources