how can select data of coefficient of 3 columns from csv file - python

I would like to plot amount of columns for 2 different scenario based on index of rows in my dataset preferably via Pandas.DataFrame :
1st scenario: columns index[2,5,8,..., n+2]
2nd scenario: the last 480 columns or column index [961-1439]
picture
I've tried to play with index of columns which is following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dft = pd.read_csv("D:\Test.csv" , header=None)
dft.head()
id_set = dft[dft.index % 2 == 0].astype('int').values
A = dft[dft.index % 2 == 1].values
B = dft[dft.index % 2 == 2].values
C = dft[dft.index % 2 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#1st scenario
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
print(index)
#2nd scenario
last_480 = df.[0:480][::-1]
I've found this post1 and post2 but they weren't my case!
I would appreciate if someone can help me.

1st scenario:
df.iloc[:, 2::3]
The slicing here means all rows, columns starting from the 2nd, and every 3 after that.
2nd scenario:
df.iloc[:, :961:-1]
The slicing here means all rows, columns to 961 from the end of the list.
EDIT:
import matplotlib.pyplot as plt
import seaborn as sns
senario1 = df.iloc[:, 2::3].copy()
sns.lineplot(data = senario1.T)
You can save the copy of the slice to another variable, then since you want to graph row-wise you need to take the transpose of the sliced matrix (This will make yours rows into columns).

Related

How to insert a new column into a dataframe and access rows with different indices?

I have a dataframe with one column "Numbers" and I want to add a second column "Result". The values should be the sum of the previous two values in the "Numbers" column, otherwise NaN.
import pandas as pd
import numpy as np
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def add_prev_two_elems_to_DF(df):
numbers = "Numbers" # alias
result = "Result" # alias
df[result] = np.nan # empty column
result_index = list(df.columns).index(result)
for i in range(len(df)):
#row = df.iloc[i]
if i < 2: df.iloc[i,result_index] = np.nan
else: df.iloc[i,result_index] = df.iloc[i-1][numbers] + df.iloc[i-2][numbers]
add_prev_two_elems_to_DF(df)
display(df)
The output is:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
But this looks quite complicated. Can this be done easier and maybe faster? I am not looking for a solution with sum(). I want a general solution for any kind of function that can fill a column using values from other rows.
Edit 1: I forgot to import numpy.
Edit 2: I changed one line to this:
if i < 2: df.iloc[i,result_index] = np.nan
Looks like you could use rolling.sum together with shift. Since rollling.sum sums until the current row, we have to shift it down one row, so that each row value matches to the sum of the previous 2 rows:
df['Result'] = df['Numbers'].rolling(2).sum().shift()
Output:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
This is the shortest code I could develop. It outputs exactly the same table.
import numpy as np
import pandas as pd
#import swifter # apply() gets swifter
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def func(a: np.ndarray) -> float: # we expect 3 elements, but we don't check that
a.reset_index(inplace=True,drop=True) # the index now starts with 0, 1,...
return a[0] + a[1] # we use the first two elements, the 3rd is unnecessary
df["Result"] = df["Numbers"].rolling(3).apply(func)
#df["Result"] = df["Numbers"].swifter.rolling(3).apply(func)
display(df)

How to do a multiplication of two different columns and rows

How can I make this account that I made in excel in python...
I wanted to take the column "Acumulado" and multiply by the bottom row of the column 'Selic por diy' and add that value in that row, and so do the same thing successively
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"Data":['06/03/2006','07/03/2006','08/03/2006','09/03/2006','10/03/2006','13/03/2006','14/03/2006','15/03/2006','16/03/2006','17/03/2006'],
"Taxa SELIC":[17.29,17.29,17.29,16.54,16.54,16.54,16.54,16.54,16.54,16.54,]})
df['Taxa Selic %'] = df['Taxa SELIC'] / 100
df['Selic por dia'] = (1 + df['Taxa SELIC'])**(1/252)
Data frame Example
Here's an example I did in excel
Second example of how I would like it to look
Not an efficient method, but you can try this:
import numpy as np
selic_per_dia = list(df['Selic por dia'].values)
accumulado = [1000000*selic_per_dia[0]]
for i,value in enumerate(selic_per_dia):
if i==0:
continue
else:
accumulado.append(accumulado[i-1]*value)
df['Acumulado'] = accumulado
df.loc[-1] = [np.nan,np.nan,np.nan,np.nan,1000000]
df.index = df.index + 1
df = df.sort_index()

Create a heatmap with Pandas/Seaborn when one column is a list of lists

Trying to create a heatmap with this data, and there are a few problems I can't solve. On the x-axis I want the Location, and the y-axis I want the Passengers. Those axises should not have duplicates, and with the x-axis (Location) it's easy to use the drop.duplicates(), but for the y-axis (Passengers) it doesn't work that well. The main problem is that the Passenger column that has multiple entries in a cell. Is there a good way to solve this? Edit I also need to get rid of the empty cells
import numpy as np
from pandas import DataFrame
import seaborn as sns
import pandas as pd
from collections.abc import Iterable
%matplotlib inline
file = "vacation.csv"
df = pd.read_csv(file)
example = df.filter(['Location', 'Passengers'])
print(example)
#x_axis = df.filter(['Location']).drop.duplicates() //Drop duplicates
Output:
Location Passengers
0 Paris []
1 Paris []
2 Stockholm []
3 Berlin ['Peter']
4 Berlin ['Maria, Debra, Kim']
... ... ...
2238 Helsinki ['Peter, Maria']
2239 Berlin ['Debra']
2240 Berlin ['Debra']
2241 Helsinki ['Debra']
2242 Paris ['Peter', 'Debra', 'Kim', 'Maria']
[2243 rows x 2 columns]
You can convert list to columns as follows, but check if it's valid for your case.
import pandas as pd
import numpy as np
def keep_one(row):
unique = {}
for val in row:
unique[val] = None
return list(unique.keys())
df['passengers_col'] = df['passengers_col'].apply(keep_one)
keys = np.unique(df['passengers_col'].apply(pd.Series).dropna()).astype('str').tolist()
cols_val = df['passengers_col'].apply(pd.Series).to_numpy().tolist()
new_cols = pd.DataFrame(data=cols_val, columns=keys)
# encode values
for k in new_cols.keys():
indices = np.where(new_cols[k] != k)[0].tolist()
new_cols.iloc[indices] = 0
Then you can use pd.concat to merge all columns with new_cols DataFrame
I'm not sure if I understood correctly, but maybe this approach could help you:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.DataFrame([["Paris",[]],["Paris",[]],
["Stockholm",[]],["Berlin",['Peter']],
["Berlin",['Maria', 'Debra', 'Kim']],["Helsinki",['Peter, Maria']],
["Berlin",['Debra']],["Berlin",['Debra']],
["Helsinki",['Debra']],["Paris",['Peter', 'Debra', 'Kim', 'Maria']]], columns = {"Location", "Passengers"})
data = data.groupby(["Location"]).sum()
cols = np.unique(np.sum(data["Passengers"]))
for col in cols:
data[col] = 0
for idx in data.index:
for col in data.loc[idx,"Passengers"]:
data.loc[idx,col] += 1
sns.heatmap(data.iloc[:,1::])
Probably you could improve performance by removing loops if your dataset is big.
It outputs the following:

Capitalize random rows in Panda Dataframe

I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized. I wrote this
import pandas as pd
import torch
df = pd.read_csv('Data/fb_moe.csv')
for i in range(len(df)):
sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())
if sample == 1:
df.iloc[i].y = str(df.iloc[i].y).capitalize()
sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())
if sample_1 == 1:
df.iloc[i].x = str(df.iloc[i].x).capitalize()
df.to_csv('Data/fb_moe2.csv')
But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?
Part of the Dataframe
x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun
Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops
n = len(df)
source = np.random.binomial(1, p=.8, size=n) == 1
target = source.copy()
total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1
df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()

How to assign a value to a column in Dask data frame

How to do the same as the bellow code for a dask data frame.
df['new_column'] = 0
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
I want to add a new column to a dask dataframe and insert 0/1 to the new column.
In case you do not wish to compute as suggested by Rajnish kumar, you can also use something along the following lines:
import dask.dataframe as dd
import pandas as pd
import numpy as np
my_df = [{"a": 1, "b": 2}, {"a": 2, "b": 3}]
df = pd.DataFrame(my_df)
dask_df = dd.from_pandas(df, npartitions=2)
dask_df["c"] = dask_df.apply(lambda x: x["a"] < 2,
axis=1,
meta=pd.Series(name="c", dtype=np.bool))
dask_df.compute()
Output:
a b c
0 1 2 True
1 2 3 False
The condition (here a check whether the entry in column "a" < 2) is applied on a row-by-row-basis. Note that depending on your condition and dependencies therein it might not necessarily be as straightforward, but in that case you could share additional information on what your condition entails.
You can't do that directly to Dask Dataframe. You first need to compute it. Use this, It will work.
df = df.compute()
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
The reason behind this is Dask Dataframe is the representation of dataframe schema, it is divided into dask-delayed task. Hope it helps you.
I was going through these answers for a similar problem I was facing.
This worked for me.
def extractAndFill(df, datetimeColumnName):
# Add 4 new columns for weekday, hour, month and year
df['pickup_date_weekday'] = 0
df['pickup_date_hour'] = 0
df['pickup_date_month'] = 0
df['pickup_date_year'] = 0
# Iterate through each row and update the values for weekday, hour, month and year
for index, row in df.iterrows():
# Get weekday, hour, month and year
w, h, m, y = extractDateParts(row[datetimeColumnName])
# Update the values
row['pickup_date_weekday'] = w
row['pickup_date_hour'] = h
row['pickup_date_month'] = m
row['pickup_date_year'] = y
return df
df1.compute()
df1 = extractAndFill(df1, 'pickup_datetime')

Categories

Resources