Change variable on the basis of last character - python

I need to change the text variable in the data set below. Namely, each row has categorical values in the object format that need to be changed, depending on the last character in the data set. Below you can see my dataset.
import pandas as pd
import numpy as np
data = {
'stores': ['Lexinton1','ROYAl2','Mall1','Mall2','Levis1','Levis2','Shark1','Shark','Lexinton'],
'quantity':[1,1,1,1,1,1,1,1,1]
}
df = pd.DataFrame(data, columns = ['stores',
'quantity'
])
df
Now I want to change this data depending on the last character. For example, if the last charter is number 1 then I want to put the word open, if is number 2 then I want to put closed.If is not a number then I don't put anything and the text will be the same. Below you can output that is desirable

You can approach this by using pandas.Series.str and pandas.Series.map.
dmap = {1: "_open", 2: "_close"}
​
suffix = pd.to_numeric(df["stores"].str[-1], errors="coerce").map(dmap).fillna("")
​
df["stores"] = df["stores"].str[:-1].add(suffix)
​
Or simply by using pandas.Series.replace :
df["stores"] = df["stores"].replace({"1$": "_open", "2$": "_close"}, regex=True)
Output :
print(df)
stores quantity
0 Lexinton_open 1
1 ROYAl_close 1
2 Mall_open 1
3 Mall_close 1
4 Levis_open 1
5 Levis_close 1
6 Shark_open 1
7 Shar 1
8 Lexinto 1

You can try this:
import pandas as pd
import numpy as np
data = {
'stores': ['Lexinton1','ROYAl2','Mall1','Mall2','Levis1','Levis2','Shark1','Shark','Lexinton'],
'quantity':[1,1,1,1,1,1,1,1,1]
}
for i in range(len(data['stores'])):
if data['stores'][i][-1] == '1':
data['stores'][i] = data['stores'][i][:-1]+'_open'
elif data['stores'][i][-1] == '2':
data['stores'][i] = data['stores'][i][:-1]+'_closed'
df = pd.DataFrame(data, columns = ['stores',
'quantity'
])

Related

How to insert a new column into a dataframe and access rows with different indices?

I have a dataframe with one column "Numbers" and I want to add a second column "Result". The values should be the sum of the previous two values in the "Numbers" column, otherwise NaN.
import pandas as pd
import numpy as np
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def add_prev_two_elems_to_DF(df):
numbers = "Numbers" # alias
result = "Result" # alias
df[result] = np.nan # empty column
result_index = list(df.columns).index(result)
for i in range(len(df)):
#row = df.iloc[i]
if i < 2: df.iloc[i,result_index] = np.nan
else: df.iloc[i,result_index] = df.iloc[i-1][numbers] + df.iloc[i-2][numbers]
add_prev_two_elems_to_DF(df)
display(df)
The output is:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
But this looks quite complicated. Can this be done easier and maybe faster? I am not looking for a solution with sum(). I want a general solution for any kind of function that can fill a column using values from other rows.
Edit 1: I forgot to import numpy.
Edit 2: I changed one line to this:
if i < 2: df.iloc[i,result_index] = np.nan
Looks like you could use rolling.sum together with shift. Since rollling.sum sums until the current row, we have to shift it down one row, so that each row value matches to the sum of the previous 2 rows:
df['Result'] = df['Numbers'].rolling(2).sum().shift()
Output:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
This is the shortest code I could develop. It outputs exactly the same table.
import numpy as np
import pandas as pd
#import swifter # apply() gets swifter
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def func(a: np.ndarray) -> float: # we expect 3 elements, but we don't check that
a.reset_index(inplace=True,drop=True) # the index now starts with 0, 1,...
return a[0] + a[1] # we use the first two elements, the 3rd is unnecessary
df["Result"] = df["Numbers"].rolling(3).apply(func)
#df["Result"] = df["Numbers"].swifter.rolling(3).apply(func)
display(df)

change the first occurrence in a pandas column based on certain condition

so I would like to change the first number in the number column to +233 given the first number is 0, basically I would like all rows in number to be like that of row Paul
Both columns are string objects.
Expectations:
The first character of the values in the column named df["numbers"] should be replaced to "+233" if only == "0"
df = pd.DataFrame([[ken, 080222333222],
[ben, +233948433],
[Paul, 0800000073]],
columns=['name', 'number'])`
Hope I understood your edition, try this:
Notice - I removed the first 0 and replaced it with +233
import pandas as pd
df = pd.DataFrame([["ken", "080222333222"], ["ben", "+233948433"], ["Paul", "0800000073"]], columns=['name', 'number'])
def convert_number(row):
if row[0] == '0':
row = "+233" + row[1:]
return row
df['number'] = df['number'].apply(convert_number)
print(df)
You can used replace directly
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
which produced
name number relace_Col
0 ken 080222333222 +23380222333222
1 ben +233948433 +233948433
2 Paul 0800000073 +233800000073
The full code to reproduce the above
import pandas as pd
df = pd.DataFrame([['ken', '080222333222'], ['ben', '+233948433'],
['Paul', '0800000073']], columns=['name', 'number'])
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
print(df)

How to assign a value to a column in Dask data frame

How to do the same as the bellow code for a dask data frame.
df['new_column'] = 0
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
I want to add a new column to a dask dataframe and insert 0/1 to the new column.
In case you do not wish to compute as suggested by Rajnish kumar, you can also use something along the following lines:
import dask.dataframe as dd
import pandas as pd
import numpy as np
my_df = [{"a": 1, "b": 2}, {"a": 2, "b": 3}]
df = pd.DataFrame(my_df)
dask_df = dd.from_pandas(df, npartitions=2)
dask_df["c"] = dask_df.apply(lambda x: x["a"] < 2,
axis=1,
meta=pd.Series(name="c", dtype=np.bool))
dask_df.compute()
Output:
a b c
0 1 2 True
1 2 3 False
The condition (here a check whether the entry in column "a" < 2) is applied on a row-by-row-basis. Note that depending on your condition and dependencies therein it might not necessarily be as straightforward, but in that case you could share additional information on what your condition entails.
You can't do that directly to Dask Dataframe. You first need to compute it. Use this, It will work.
df = df.compute()
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
The reason behind this is Dask Dataframe is the representation of dataframe schema, it is divided into dask-delayed task. Hope it helps you.
I was going through these answers for a similar problem I was facing.
This worked for me.
def extractAndFill(df, datetimeColumnName):
# Add 4 new columns for weekday, hour, month and year
df['pickup_date_weekday'] = 0
df['pickup_date_hour'] = 0
df['pickup_date_month'] = 0
df['pickup_date_year'] = 0
# Iterate through each row and update the values for weekday, hour, month and year
for index, row in df.iterrows():
# Get weekday, hour, month and year
w, h, m, y = extractDateParts(row[datetimeColumnName])
# Update the values
row['pickup_date_weekday'] = w
row['pickup_date_hour'] = h
row['pickup_date_month'] = m
row['pickup_date_year'] = y
return df
df1.compute()
df1 = extractAndFill(df1, 'pickup_datetime')

Equivalent of pd.Series.str.slice() and pd.Series.apply() in cuDF

I am wanting to convert the following code (which runs in pandas) to code that runs in cuDF.
Sample data from .head() of Series being manipulated is plugged into OG code in the 3rd code cell down -- should be able to copy/paste run.
Original code in pandas
# both are float columns now
# rawcensustractandblock
s_rawcensustractandblock = df_train['rawcensustractandblock'].apply(lambda x: str(x))
# adjust/set new tract number
df_train['census_tractnumber'] = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
df_train['block_number'] = s_rawcensustractandblock.str.slice(start=11)
df_train['block_number'] = df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )
df_train['block_number'] = df_train['block_number'].apply(lambda x: int(round(float(x),0)) )
df_train['block_number'] = df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )
Data being manipulated
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
Code adjusted to start with this sample data
Here's how the code looks when using the above provided data instead of the entire dataframe.
Based on errors encountered when trying to convert, this issue is at the Series level, so the converting the cell below to execute in cuDF should solve the problem.
import pandas as pd
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# how the first line looks using the series
s_rawcensustractandblock = data.apply(lambda x: str(x))
# adjust/set new tract number
census_tractnumber = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
block_number = s_rawcensustractandblock.str.slice(start=11)
block_number = block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0' )
block_number = block_number.apply(lambda x: int(round(float(x),0)) )
block_number = block_number.apply(lambda x: str(x).ljust(4,'0') )
Expected changes (output)
df_train['census_tractnumber'].head()
# out
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
Name: census_tractnumber, dtype: object
df_train['block_number'].head()
0 1001
1 2024
2 3004
3 2002
4 1006
Name: block_number, dtype: object
You can use cuDF string methods (via nvStrings) for almost everything you're trying to do. You will lose some precision converting these floats to strings in cuDF (though it may not matter in your example above), so for this example I've simply converted beforehand. If possible, I would recommend initially creating the rawcensustractandblock as a string column rather than a float column.
import cudf
import pandas as pd
​
gdata = cudf.from_pandas(pd_data.astype('str'))
​
tractnumber = gdata.str.slice(4,11)
blocknumber = gdata.str.slice(11)
blocknumber = blocknumber.str.slice(0,4).str.cat(blocknumber.str.slice(4), '.')
blocknumber = blocknumber.astype('float').round(0).astype('int')
blocknumber = blocknumber.astype('str').str.ljust(4, '0')
​
tractnumber
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
dtype: object
blocknumber
0 1001
1 2024
2 3004
3 2002
4 1006
dtype: object
for loop solution
pandas (original code)
import pandas as pd
# data from df_train.rawcensustractandblock.head()
pd_data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
pd_raw_block = pd_data.apply(lambda x: str(x))
# adjust/set new tract number
pd_tractnumber = pd_raw_block.str.slice(4,11)
# set/adjust block number
pd_block_number = pd_raw_block.str.slice(11)
pd_block_number = pd_block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0')
pd_block_number = pd_block_number.apply(lambda x: int(round(float(x),0)))
pd_block_number = pd_block_number.apply(lambda x: str(x).ljust(4,'0'))
# print(list(pd_tractnumber))
# print(list(pd_block_number))
cuDF (solution code)
import cudf
# data from df_train.rawcensustractandblock.head()
cudf_data = cudf.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
cudf_tractnumber = cudf_data.values_to_string()
# adjust/set new tract number
for i in range(len(cudf_tractnumber)):
funct = slice(4,11)
cudf_tractnumber[i] = cudf_tractnumber[i][funct]
# using series instead of dataframe
cudf_block_number = cudf_data.values_to_string()
# set/adjust block number
for i in range(len(cudf_block_number)):
funct = slice(11, None)
cudf_block_number[i] = cudf_block_number[i][funct]
cudf_block_number[i] = cudf_block_number[i][:4]+'.'+cudf_block_number[i][4:]+'0'
cudf_block_number[i] = int(round(float(cudf_block_number[i]), 0))
cudf_block_number[i] = str(cudf_block_number[i]).ljust(4,'0')
# print(cudf_tractnumber)
# print(cudf_block_number)

Python: Printing dataframe to csv

I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335
The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.

Categories

Resources