python pandas dataframe filling e.g. bfill, ffill - python

I have two problems with filling out a very large dataframe. There is a section of the picture. I want the 1000 in E and F to be pulled down to 26 and no further. In the same way I want the 2000 to be pulled up to -1 and down to the next 26. I thought I could do this with bfill and ffill, but unfortunately I don't know how...(picture1)
Another problem is that columns occur in which the values from -1 to 26 do not contain any values in E and F. How can I delete or fill them with 0 so that no bfill or ffill makes wrong entries there?
(picture2)
import pandas as pd
import numpy as np
data = '/Users/Hanna/Desktop/Coding/Code.csv'
df_1 = pd.read_csv(data,usecols=["A",
"B",
"C",
"D",
"E",
"F",
],nrows=75)
base_list =[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
df_c = pd.MultiIndex.from_product([
[4000074],
["SP000796746","SP001811642"],
[201824, 201828, 201832, 201835, 201837, 201839, 201845, 201850, 201910, 201918, 201922, 201926, 201909, 201916, 201918, 201920],
base_list],
names=["A", "B", "C", "D"]).to_frame(index=False)
df_3 = pd.merge(df_c, df_1, how='outer')
To understand it better, I have shortened the example a bit. Picture 3 shows how it looks like when it is filled and picture 4 shows it correctly filled

Assuming you have to find and fill values for a particular segment.
data = pd.read_csv('/Users/Hanna/Desktop/Coding/Code.csv')
for i in range(0,data.shape[0],27):
if i+27 < data.shape[0]:
data.loc[i:i+27,'E'] = max(data['E'].iloc[i:i+27])
else:
data.loc[i:data.shape[0],'E'] = max(data['E'].iloc[i:data.shape[0]])
you can replace the max to whatever you want.

could find the indexes where you have -1 and then slice/loop over the columns to fill.
just to create the sample data:
import pandas as pd
df = pd.DataFrame(columns=list('ABE'))
df['A']=list(range(-1, 26)) * 10
add random values at each section
import random
for i in df.index:
if i%27 == 0:
df.loc[i,'B'] = random.random()
else:
df.loc[i, 'B'] = 0
find the indexes to slice over
indx = df[df['A'] == -1].index.values
fill out data in column "E"
for i, j in zip(indx[:-1], indx[1:]):
df.loc[i:j-1, 'E'] = df.loc[i:j-1, 'B'].max()
if j == indx[-1]:
df.loc[j:, 'E'] = df.loc[j:, 'B'].max()

Related

Consolidation of consecutive rows by condition with Python Pandas

I try to handle the next data issue. I have a dataframe of values and their labels list (this is multi-class, so the labels are a list).
The dataframe looks like:
| value| labels
---------------------
row_1| A |[label1]
row_2| B |[label2]
row_3| C |[label3, label4]
row_4| D |[label4, label5]
I want to find all rows that have a specific label and then:
Firstly, concatenate it with the next row - the string will be concatenated before the next row's value.
Secondly, the labels will be appended to the label list of the next row
For example, if I want to do that for label2, the desired output will be:
| value| labels
---------------------
row_1| A |[label1]
row_3| BC |[label2, label3, label4]
row_4| D |[label4, label5]
The value "B" is joined before the next row's values, and the label "label2" will be appended to the beginning of the next row's label list. The indexes are not relevant for me.
I would greatly appreciate help with this. I tried to use, merge, join, shift, and cumsum but without success so far.
The following code creates the data in the example:
data = {'row_1': ["A", ["label1"]], 'row_2': ["B", ["label2"]],
'row_3':["C", ["label3", "label4"]], 'row_4': ["D", ["label4", "label5"]]}
df = pd.DataFrame.from_dict(data, orient='index').rename(columns={0: "value", 1: "labels"})
You could create a grouping variable and use that to aggregate the columns
import pandas as pd
import numpy as np
def my_combine(data, value):
index = data['labels'].apply(lambda x: np.isin(value, x))
if(all(~index)):
return data
idx = (index | index.shift()).to_numpy()
vals = (np.arange(idx.size) + 1) *(~idx)
gr = np.r_[np.where(vals[1:] != vals[:-1])[0], vals.size - 1]
groups = np.repeat(gr, np.diff(np.r_[-1, gr]) )
return data.groupby(groups).agg(sum)
my_combine(df, 'label2')
value labels
0 A [label1]
2 BC [label2, label3, label4]
3 D [label4, label5]

Split a dataframe into two dataframe using first column that have a string values in python

I have two .txt file where I want to separate the data frame into two parts using the first column value. If the value is less than "H1000", we want in a first dataframe and if it is greater or equal to "H1000" we want in a second dataframe.First column starts the value with H followed by a four numbers. I want to ignore H when comparing numbers less than 1000 or greater than 1000 in python.
What I have tried this thing,but it is not working.
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
Here is my code:
if (".txt" in str(path_txt).lower()) and path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
all_dfs = pd.read_csv(fn,sep="\t", header=None) #Reading file
all_dfs = all_dfs.dropna(axis=1, how='all') #Drop the columns where all columns are NaN
all_dfs = all_dfs.dropna(axis=0, how='all') #Drop the rows where all columns are NaN
print(all_dfs)
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
df_h = all_dfs[0:ht_data] # Head Data
df_t = all_dfs[ht_data:] # Tene Data
Can anyone help me how to achieve this task in python?
Assuming this data
import pandas as pd
data = pd.DataFrame(
[
["H0002", "Version", "5"],
["H0003", "Date_generated", "8-Aug-11"],
["H0004", "Reporting_period_end_date", "19-Jun-11"],
["H0005", "State", "AW"],
["H1000", "Tene_no/Combined_rept_no", "E75/3794"],
["H1001", "Tenem_holder Magnetic Resources", "NL"],
],
columns = ["id", "col1", "col2"]
)
We can create a mask of over and under a pre set threshold, like 1000.
mask = data["id"].str.strip("H").astype(int) < 1000
df_h = data[mask]
df_t = data[~mask]
If you want to compare values of the format val = HXXXX where X is a digit represented as a character, try this:
val = 'H1003'
val_cmp = int(val[1:])
if val_cmp < 1000:
# First Dataframe
else:
# Second Dataframe

Applying a missing value distribution of a Dataframe to a subset of the Dataframe: Needs to be faster

I have a large pandas Dataframe (20k rows). Mocking up some data:
columns = [chr(i) for i in range(ord('a'),ord('z')+1)]
df = pd.DataFrame(np.random.randint(0,100,size=(20000, 26)), columns=columns)
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = np.nan
In this Dataframe some of the columns might contain missing values indicated with NaN. I need to filter out missing values for a single column of and return a new Dataframe that has no missing values in that column:
col_name = "a"
dframe = df.copy()
df_col = dframe[~dframe[col_name].isnull()]
Now df_col still may have missing values in the other columns of the subset. But what I have lost is when missing values co-occur with what I filtered out. So if col_name is "A" it might be that "D" is usually missing when "A" is missing. Now it appears that "D" is always present in df_col.
I want to take the missing distribution from dframe and randomly sample from df_col to simulate the missing values. By missing distribution I mean combinations of column names that have NaN values and their proportions:
{
["A", "E", "G"]: 0.24,
["G", "Z"]: 0.01,
["G"]: 0.32,
...,
["R", "M"]: 0.09
}
I have functions that do this but they are too slow for my needs:
from typing import List, Dict
import pandas as pd
import numpy as np
def get_freq_dict(df: pd.DataFrame) -> Dict[str, float]:
num_samples = df.shape[0]
col_names = df.columns
list_of_lists = df.apply(lambda row: [i for i in col_names if np.isnan(row[i])], axis=1).tolist()
output = {}
for lis in list_of_lists:
output.setdefault(tuple(lis), list()).append(1)
for a, b in output.items():
output[a] = sum(b) / float(num_samples)
return output
def add_in_missingness(df_col, freq_dist) -> pd.DataFrame:
sample_list = []
df_m = df_col.copy()
for key in freq_dist:
sample = df_m.sample(frac=freq_dist[key], replace=False, random_state=1)
# remove idx from options
blacklist = list(sample.index)
df_m = df_m[~df_m.index.isin(blacklist)]
col_names = list(key)
sample[col_names] = np.NaN
sample_list.append(sample)
df_col = pd.concat(sample_list)
return df_col
running it
%%time
import time
freq_dist = get_freq_dict(df)
df_ = add_in_missingness(df_col, freq_dist)
Seems to work but it takes far too long for my purposes:
CPU times: user 1min 1s, sys: 439 ms, total: 1min 1s
Wall time: 1min 1s
I need help making these function efficient. Any ideas?

How to assign a value to a column in Dask data frame

How to do the same as the bellow code for a dask data frame.
df['new_column'] = 0
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
I want to add a new column to a dask dataframe and insert 0/1 to the new column.
In case you do not wish to compute as suggested by Rajnish kumar, you can also use something along the following lines:
import dask.dataframe as dd
import pandas as pd
import numpy as np
my_df = [{"a": 1, "b": 2}, {"a": 2, "b": 3}]
df = pd.DataFrame(my_df)
dask_df = dd.from_pandas(df, npartitions=2)
dask_df["c"] = dask_df.apply(lambda x: x["a"] < 2,
axis=1,
meta=pd.Series(name="c", dtype=np.bool))
dask_df.compute()
Output:
a b c
0 1 2 True
1 2 3 False
The condition (here a check whether the entry in column "a" < 2) is applied on a row-by-row-basis. Note that depending on your condition and dependencies therein it might not necessarily be as straightforward, but in that case you could share additional information on what your condition entails.
You can't do that directly to Dask Dataframe. You first need to compute it. Use this, It will work.
df = df.compute()
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
The reason behind this is Dask Dataframe is the representation of dataframe schema, it is divided into dask-delayed task. Hope it helps you.
I was going through these answers for a similar problem I was facing.
This worked for me.
def extractAndFill(df, datetimeColumnName):
# Add 4 new columns for weekday, hour, month and year
df['pickup_date_weekday'] = 0
df['pickup_date_hour'] = 0
df['pickup_date_month'] = 0
df['pickup_date_year'] = 0
# Iterate through each row and update the values for weekday, hour, month and year
for index, row in df.iterrows():
# Get weekday, hour, month and year
w, h, m, y = extractDateParts(row[datetimeColumnName])
# Update the values
row['pickup_date_weekday'] = w
row['pickup_date_hour'] = h
row['pickup_date_month'] = m
row['pickup_date_year'] = y
return df
df1.compute()
df1 = extractAndFill(df1, 'pickup_datetime')

how can select data of coefficient of 3 columns from csv file

I would like to plot amount of columns for 2 different scenario based on index of rows in my dataset preferably via Pandas.DataFrame :
1st scenario: columns index[2,5,8,..., n+2]
2nd scenario: the last 480 columns or column index [961-1439]
picture
I've tried to play with index of columns which is following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dft = pd.read_csv("D:\Test.csv" , header=None)
dft.head()
id_set = dft[dft.index % 2 == 0].astype('int').values
A = dft[dft.index % 2 == 1].values
B = dft[dft.index % 2 == 2].values
C = dft[dft.index % 2 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#1st scenario
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
print(index)
#2nd scenario
last_480 = df.[0:480][::-1]
I've found this post1 and post2 but they weren't my case!
I would appreciate if someone can help me.
1st scenario:
df.iloc[:, 2::3]
The slicing here means all rows, columns starting from the 2nd, and every 3 after that.
2nd scenario:
df.iloc[:, :961:-1]
The slicing here means all rows, columns to 961 from the end of the list.
EDIT:
import matplotlib.pyplot as plt
import seaborn as sns
senario1 = df.iloc[:, 2::3].copy()
sns.lineplot(data = senario1.T)
You can save the copy of the slice to another variable, then since you want to graph row-wise you need to take the transpose of the sliced matrix (This will make yours rows into columns).

Categories

Resources