This question maybe super basic and apologize for that..
But I am trying to create a for loop that would enter a value of 1 or 0 into a pandas dataframe based on a condition.
import pandas as pd
def checkHour6(time):
val = 0
if time == 6:
val = 1
return val
def checkHour7(time):
val = 0
if time == 7:
val = 1
return val
def checkHour8(time):
val = 0
if time == 8:
val = 1
return val
def checkHour9(time):
val = 0
if time == 9:
val = 1
return val
def checkHour10(time):
val = 0
if time == 10:
val = 1
return val
This for loop that I am attempting will count from 0 to 23, and I am attempting to building pandas dataframe in the loop process that will enter a value of a 1 or 0 appropriately but I am missing something basic as the final df result is an empty dataframe.
Create empty df:
df = pd.DataFrame({'hour_6':[], 'hour_7':[], 'hour_8':[], 'hour_9':[], 'hour_10':[]})
For Loop:
hour = -1
for i in range(24):
stuff = []
hour = hour + 1
stuff.append(checkHour6(hour))
stuff.append(checkHour7(hour))
stuff.append(checkHour8(hour))
stuff.append(checkHour9(hour))
stuff.append(checkHour10(hour))
df.append(stuff)
I would suggest the following:
use only one checkHour() function with a parameter for hour,
according to pandas.DataFrame.append() documentation, other parameter has to be DataFrame or Series/dict-like object, or list of these, so list cannot be used,
if you want to make a data frame by appending new rows to the existing one, you have to assign it.
The code can look like this:
def checkHour(time, hour):
val = 0
if time == hour:
val = 1
return val
df = pd.DataFrame({'hour_6':[], 'hour_7':[], 'hour_8':[], 'hour_9':[], 'hour_10':[]})
hour = -1
for i in range(24):
stuff = {}
hour = hour + 1
stuff['hour_6'] = checkHour(hour, 6)
stuff['hour_7'] = checkHour(hour, 7)
stuff['hour_8'] = checkHour(hour, 8)
stuff['hour_9'] = checkHour(hour, 9)
stuff['hour_10'] = checkHour(hour, 10)
df = df.append(stuff, ignore_index=True)
The result is following:
>>> print(df)
hour_6 hour_7 hour_8 hour_9 hour_10
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0 0.0
8 0.0 0.0 1.0 0.0 0.0
9 0.0 0.0 0.0 1.0 0.0
10 0.0 0.0 0.0 0.0 1.0
11 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0
14 0.0 0.0 0.0 0.0 0.0
15 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.0 0.0
18 0.0 0.0 0.0 0.0 0.0
19 0.0 0.0 0.0 0.0 0.0
20 0.0 0.0 0.0 0.0 0.0
21 0.0 0.0 0.0 0.0 0.0
22 0.0 0.0 0.0 0.0 0.0
23 0.0 0.0 0.0 0.0 0.0
EDIT:
As #Parfait mentioned, it is not good to use pandas.DataFrame.append() in for loop, because it leads to quadratic copying. To avoid that, you can make a list of dictionaries (future data frame rows) and after that call pd.DataFrame() to make a data frame out of it. The code looks like this:
def checkHour(time, hour):
val = 0
if time == hour:
val = 1
return val
data = []
hour = -1
for i in range(24):
stuff = {}
hour = hour + 1
stuff['hour_6'] = checkHour(hour, 6)
stuff['hour_7'] = checkHour(hour, 7)
stuff['hour_8'] = checkHour(hour, 8)
stuff['hour_9'] = checkHour(hour, 9)
stuff['hour_10'] = checkHour(hour, 10)
data.append(stuff)
df = pd.DataFrame(data)
And the result is following:
>>> print(df)
hour_6 hour_7 hour_8 hour_9 hour_10
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 0 1
11 0 0 0 0 0
12 0 0 0 0 0
13 0 0 0 0 0
14 0 0 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 0 0 0 0 0
18 0 0 0 0 0
19 0 0 0 0 0
20 0 0 0 0 0
21 0 0 0 0 0
22 0 0 0 0 0
23 0 0 0 0 0
Another really simple solution, how to create your data frame is to use pandas.get_dummies() function like this:
df = pd.DataFrame({'hour': range(24)})
df = pd.get_dummies(df.hour, prefix='hour')
df = df[['hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10']]
Quick glance for the blankness issue I'd say:
hour = -1
stuff = []
for i in range(24):
hour = hour + 1
stuff.append(checkHour6(hour))
stuff.append(checkHour7(hour))
stuff.append(checkHour8(hour))
stuff.append(checkHour9(hour))
stuff.append(checkHour10(hour))
df.append(stuff)
May be a better solution to the whole process though.
start off with a data column (what hour is it)
then all the other comparisons can be queried from that.
import pandas as pd
df = pd.DataFrame(range(24), columns= ['data'])
for time in range(6,11):
df[f'hour_{time}'] = df['data']%24==time
df = df.astype(int)
If you want you can remove the data column later.
data hour_6 hour_7 hour_8 hour_9 hour_10
0 0 0 0 0 0 0
1 1 0 0 0 0 0
2 2 0 0 0 0 0
3 3 0 0 0 0 0
4 4 0 0 0 0 0
5 5 0 0 0 0 0
6 6 1 0 0 0 0
7 7 0 1 0 0 0
8 8 0 0 1 0 0
9 9 0 0 0 1 0
10 10 0 0 0 0 1
11 11 0 0 0 0 0
12 12 0 0 0 0 0
13 13 0 0 0 0 0
14 14 0 0 0 0 0
15 15 0 0 0 0 0
16 16 0 0 0 0 0
17 17 0 0 0 0 0
18 18 0 0 0 0 0
19 19 0 0 0 0 0
20 20 0 0 0 0 0
21 21 0 0 0 0 0
22 22 0 0 0 0 0
23 23 0 0 0 0 0
Because the object model in numpy and pandas differs from general Python, consider avoiding building objects in a loop like you would with simpler iterables like list or dict.
In fact, your setup can be handled with simply DataFrame.pivot with a column of 24 sequential integers without any function or loop! In fact, you can return more hour columns (i.e., hour_0-hour_24) easily or reindex for your needed five columns:
Data
df = (pd.DataFrame({'hour': ['hour' for _ in range(24)]})
.assign(hour = lambda x: x['hour'] + '_' + pd.Series(range(24)).astype('str'),
num = 1)
)
df3.head(5)
# hour num
# 0 hour_0 1
# 1 hour_1 1
# 2 hour_2 1
# 3 hour_3 1
# 4 hour_4 1
Pivot
pvt_df = (df.pivot(columns='hour', values='num')
.fillna(0)
.reindex(['hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10'], axis='columns')
)
pvt_df
# hour hour_6 hour_7 hour_8 hour_9 hour_10
# 0 0.0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0 0.0
# 3 0.0 0.0 0.0 0.0 0.0
# 4 0.0 0.0 0.0 0.0 0.0
# 5 0.0 0.0 0.0 0.0 0.0
# 6 1.0 0.0 0.0 0.0 0.0
# 7 0.0 1.0 0.0 0.0 0.0
# 8 0.0 0.0 1.0 0.0 0.0
# 9 0.0 0.0 0.0 1.0 0.0
# 10 0.0 0.0 0.0 0.0 1.0
# 11 0.0 0.0 0.0 0.0 0.0
# 12 0.0 0.0 0.0 0.0 0.0
# 13 0.0 0.0 0.0 0.0 0.0
# 14 0.0 0.0 0.0 0.0 0.0
# 15 0.0 0.0 0.0 0.0 0.0
# 16 0.0 0.0 0.0 0.0 0.0
# 17 0.0 0.0 0.0 0.0 0.0
# 18 0.0 0.0 0.0 0.0 0.0
# 19 0.0 0.0 0.0 0.0 0.0
# 20 0.0 0.0 0.0 0.0 0.0
# 21 0.0 0.0 0.0 0.0 0.0
# 22 0.0 0.0 0.0 0.0 0.0
# 23 0.0 0.0 0.0 0.0 0.0
Related
I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN
SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object
My file is formatted like this:
2106 2002 27 26 1
1 0.000000 0.000000
2 0.389610 0.000000
3 0.779221 0.000000
4 1.168831 0.000000
5 1.558442 0.000000
6 1.948052 0.000000
7 2.337662 0.000000
8 2.727273 0.000000
9 3.116883 0.000000
10 3.506494 0.000000
I want to read in these. There are more rows than this and some only have two columns. In MATLAB I use readmatrix() and it works well, does Python have anything comparable? Because python genfromtxt() and python loadtxt do not work with a variable number of columns.
Should I just stick with MATLAB since Python seems to be missing key functionality like this?
Edit: Here is the output that I get in matlab that I would like in numpy:
2106 2002 27 26 1 0
1 0 0 0 0 0
2 0.389610000000000 0 0 0 0
3 0.779221000000000 0 0 0 0
4 1.16883100000000 0 0 0 0
5 1.55844200000000 0 0 0 0
6 1.94805200000000 0 0 0 0
7 2.33766200000000 0 0 0 0
8 2.72727300000000 0 0 0 0
9 3.11688300000000 0 0 0 0
10 3.50649400000000 0 0 0 0
import numpy as np
headers = []
rows = []
with open("test.txt", 'r') as file:
for i, v in enumerate(file.readlines()):
if i == 0:
headers.extend(v.split())
else:
rows.append(v.split())
for i, v in enumerate(rows):
while len(v) != len(headers):
v.append(0)
rows[i] = v
rows = np.array(rows)
let me know if any modifications are needed
You have missing values in your columns that matlab interprets them as 0. You can import similar structure to pandas and pandas will have right number of columns. It interprets missing values as nan which you can later replace with 0 if you prefer that way. The only catch is have the right column number in first row. If you have 0 at the end of it, put it 0 instead of space:
df = pd.read_csv('file.csv', sep='\s+').fillna(0)
output:
2106 2002 27 26 1 0
0 1 0.000000 0.0 0.0 0.0 0.0
1 2 0.389610 0.0 0.0 0.0 0.0
2 3 0.779221 0.0 0.0 0.0 0.0
3 4 1.168831 0.0 0.0 0.0 0.0
4 5 1.558442 0.0 0.0 0.0 0.0
5 6 1.948052 0.0 0.0 0.0 0.0
6 7 2.337662 0.0 0.0 0.0 0.0
7 8 2.727273 0.0 0.0 0.0 0.0
8 9 3.116883 0.0 0.0 0.0 0.0
9 10 3.506494 0.0 0.0 0.0 0.0
It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!
Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:
Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0
I am trying to import Semeion Handwritten Digit Data Set as a pandas DataFrame, but the first row is being taken as column names.
df.head()
0.0000 0.0000.1 0.0000.2 0.0000.3 0.0000.4 0.0000.5 1.0000 1.0000.1 \
0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
1.0000.2 1.0000.3 ... 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
1 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
2 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
3 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
4 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
[5 rows x 266 columns]
Since the DataFrame has 266 columns, I am trying to assign numbers as column names, using lambda and a for loop.... using the following code:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = (lambda x: x for x in range(0,266)) )
But am getting weird column names, like:
>>> df.head(2)
<function <genexpr>.<lambda> at 0x04F4E588> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E618> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E660> \
0 0.0
1 0.0
If I remove the parenthesis, then the code throws a syntax error:
>>> df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = lambda x: x for x in range(0,266) )
SyntaxError: invalid syntax
Can someone tell me:
1) How to get column names as numbers... from 0 to 266
2) If in case I get a DataFrame with first row as column names, how do I push it down and add new column names, without losing the first row?
TIA
I think you need parameter header=None or names=range(266) for set default names of columns in read_csv:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data"
df = pd.read_csv(url, sep = r"\s+", header=None)
df = pd.read_csv(url, sep = r"\s+", names=range(266))
Also you can try something like:
my_columns = [range(266)]