I am trying to import Semeion Handwritten Digit Data Set as a pandas DataFrame, but the first row is being taken as column names.
df.head()
0.0000 0.0000.1 0.0000.2 0.0000.3 0.0000.4 0.0000.5 1.0000 1.0000.1 \
0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
1.0000.2 1.0000.3 ... 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
1 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
2 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
3 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
4 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0
[5 rows x 266 columns]
Since the DataFrame has 266 columns, I am trying to assign numbers as column names, using lambda and a for loop.... using the following code:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = (lambda x: x for x in range(0,266)) )
But am getting weird column names, like:
>>> df.head(2)
<function <genexpr>.<lambda> at 0x04F4E588> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E618> \
0 0.0
1 0.0
<function <genexpr>.<lambda> at 0x04F4E660> \
0 0.0
1 0.0
If I remove the parenthesis, then the code throws a syntax error:
>>> df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data", delimiter = r"\s+",
names = lambda x: x for x in range(0,266) )
SyntaxError: invalid syntax
Can someone tell me:
1) How to get column names as numbers... from 0 to 266
2) If in case I get a DataFrame with first row as column names, how do I push it down and add new column names, without losing the first row?
TIA
I think you need parameter header=None or names=range(266) for set default names of columns in read_csv:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data"
df = pd.read_csv(url, sep = r"\s+", header=None)
df = pd.read_csv(url, sep = r"\s+", names=range(266))
Also you can try something like:
my_columns = [range(266)]
Related
I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)
I would like to see how many times a url is labelled with 1 and how many times it is labelled with 0.
My dataset is
Label URL
0 0.0 www.nytimes.com
1 0.0 newatlas.com
2 1.0 www.facebook.com
3 1.0 www.facebook.com
4 0.0 issuu.com
... ... ...
3572 0.0 www.businessinsider.com
3573 0.0 finance.yahoo.com
3574 0.0 www.cnbc.com
3575 0.0 www.ndtv.com
3576 0.0 www.baystatehealth.org
I tried df.groupby("URL")["Label"].count() but it does not return the expected output:
Label URL Freq
0 0.0 www.nytimes.com 1
0 1.0 www.nytimes.com 0
1 0.0 newatlas.com 1
1 1.0 newatlas.com 0
2 1.0 www.facebook.com 2
2 0.0 www.facebook.com 0
4 0.0 issuu.com 1
4 1.0 issuu.com 0
... ... ...
What field should I consider I the group by to get something like the above df (expected output)?
You need unique combinations of URL and Label.
df.groupby(["URL","Label"]).count()
Now you can do value_counts
df.value_counts(["URL","Label"])
Use agg:
df.groupby("URL").agg({'Label',lambda x: x.nunique()})
I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages.
I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours.
Is there a pythonic way to improve this?
Here's how I've done it using apply method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.
My file is formatted like this:
2106 2002 27 26 1
1 0.000000 0.000000
2 0.389610 0.000000
3 0.779221 0.000000
4 1.168831 0.000000
5 1.558442 0.000000
6 1.948052 0.000000
7 2.337662 0.000000
8 2.727273 0.000000
9 3.116883 0.000000
10 3.506494 0.000000
I want to read in these. There are more rows than this and some only have two columns. In MATLAB I use readmatrix() and it works well, does Python have anything comparable? Because python genfromtxt() and python loadtxt do not work with a variable number of columns.
Should I just stick with MATLAB since Python seems to be missing key functionality like this?
Edit: Here is the output that I get in matlab that I would like in numpy:
2106 2002 27 26 1 0
1 0 0 0 0 0
2 0.389610000000000 0 0 0 0
3 0.779221000000000 0 0 0 0
4 1.16883100000000 0 0 0 0
5 1.55844200000000 0 0 0 0
6 1.94805200000000 0 0 0 0
7 2.33766200000000 0 0 0 0
8 2.72727300000000 0 0 0 0
9 3.11688300000000 0 0 0 0
10 3.50649400000000 0 0 0 0
import numpy as np
headers = []
rows = []
with open("test.txt", 'r') as file:
for i, v in enumerate(file.readlines()):
if i == 0:
headers.extend(v.split())
else:
rows.append(v.split())
for i, v in enumerate(rows):
while len(v) != len(headers):
v.append(0)
rows[i] = v
rows = np.array(rows)
let me know if any modifications are needed
You have missing values in your columns that matlab interprets them as 0. You can import similar structure to pandas and pandas will have right number of columns. It interprets missing values as nan which you can later replace with 0 if you prefer that way. The only catch is have the right column number in first row. If you have 0 at the end of it, put it 0 instead of space:
df = pd.read_csv('file.csv', sep='\s+').fillna(0)
output:
2106 2002 27 26 1 0
0 1 0.000000 0.0 0.0 0.0 0.0
1 2 0.389610 0.0 0.0 0.0 0.0
2 3 0.779221 0.0 0.0 0.0 0.0
3 4 1.168831 0.0 0.0 0.0 0.0
4 5 1.558442 0.0 0.0 0.0 0.0
5 6 1.948052 0.0 0.0 0.0 0.0
6 7 2.337662 0.0 0.0 0.0 0.0
7 8 2.727273 0.0 0.0 0.0 0.0
8 9 3.116883 0.0 0.0 0.0 0.0
9 10 3.506494 0.0 0.0 0.0 0.0
This question maybe super basic and apologize for that..
But I am trying to create a for loop that would enter a value of 1 or 0 into a pandas dataframe based on a condition.
import pandas as pd
def checkHour6(time):
val = 0
if time == 6:
val = 1
return val
def checkHour7(time):
val = 0
if time == 7:
val = 1
return val
def checkHour8(time):
val = 0
if time == 8:
val = 1
return val
def checkHour9(time):
val = 0
if time == 9:
val = 1
return val
def checkHour10(time):
val = 0
if time == 10:
val = 1
return val
This for loop that I am attempting will count from 0 to 23, and I am attempting to building pandas dataframe in the loop process that will enter a value of a 1 or 0 appropriately but I am missing something basic as the final df result is an empty dataframe.
Create empty df:
df = pd.DataFrame({'hour_6':[], 'hour_7':[], 'hour_8':[], 'hour_9':[], 'hour_10':[]})
For Loop:
hour = -1
for i in range(24):
stuff = []
hour = hour + 1
stuff.append(checkHour6(hour))
stuff.append(checkHour7(hour))
stuff.append(checkHour8(hour))
stuff.append(checkHour9(hour))
stuff.append(checkHour10(hour))
df.append(stuff)
I would suggest the following:
use only one checkHour() function with a parameter for hour,
according to pandas.DataFrame.append() documentation, other parameter has to be DataFrame or Series/dict-like object, or list of these, so list cannot be used,
if you want to make a data frame by appending new rows to the existing one, you have to assign it.
The code can look like this:
def checkHour(time, hour):
val = 0
if time == hour:
val = 1
return val
df = pd.DataFrame({'hour_6':[], 'hour_7':[], 'hour_8':[], 'hour_9':[], 'hour_10':[]})
hour = -1
for i in range(24):
stuff = {}
hour = hour + 1
stuff['hour_6'] = checkHour(hour, 6)
stuff['hour_7'] = checkHour(hour, 7)
stuff['hour_8'] = checkHour(hour, 8)
stuff['hour_9'] = checkHour(hour, 9)
stuff['hour_10'] = checkHour(hour, 10)
df = df.append(stuff, ignore_index=True)
The result is following:
>>> print(df)
hour_6 hour_7 hour_8 hour_9 hour_10
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0
6 1.0 0.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0 0.0
8 0.0 0.0 1.0 0.0 0.0
9 0.0 0.0 0.0 1.0 0.0
10 0.0 0.0 0.0 0.0 1.0
11 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0
14 0.0 0.0 0.0 0.0 0.0
15 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.0 0.0
18 0.0 0.0 0.0 0.0 0.0
19 0.0 0.0 0.0 0.0 0.0
20 0.0 0.0 0.0 0.0 0.0
21 0.0 0.0 0.0 0.0 0.0
22 0.0 0.0 0.0 0.0 0.0
23 0.0 0.0 0.0 0.0 0.0
EDIT:
As #Parfait mentioned, it is not good to use pandas.DataFrame.append() in for loop, because it leads to quadratic copying. To avoid that, you can make a list of dictionaries (future data frame rows) and after that call pd.DataFrame() to make a data frame out of it. The code looks like this:
def checkHour(time, hour):
val = 0
if time == hour:
val = 1
return val
data = []
hour = -1
for i in range(24):
stuff = {}
hour = hour + 1
stuff['hour_6'] = checkHour(hour, 6)
stuff['hour_7'] = checkHour(hour, 7)
stuff['hour_8'] = checkHour(hour, 8)
stuff['hour_9'] = checkHour(hour, 9)
stuff['hour_10'] = checkHour(hour, 10)
data.append(stuff)
df = pd.DataFrame(data)
And the result is following:
>>> print(df)
hour_6 hour_7 hour_8 hour_9 hour_10
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 0 1
11 0 0 0 0 0
12 0 0 0 0 0
13 0 0 0 0 0
14 0 0 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 0 0 0 0 0
18 0 0 0 0 0
19 0 0 0 0 0
20 0 0 0 0 0
21 0 0 0 0 0
22 0 0 0 0 0
23 0 0 0 0 0
Another really simple solution, how to create your data frame is to use pandas.get_dummies() function like this:
df = pd.DataFrame({'hour': range(24)})
df = pd.get_dummies(df.hour, prefix='hour')
df = df[['hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10']]
Quick glance for the blankness issue I'd say:
hour = -1
stuff = []
for i in range(24):
hour = hour + 1
stuff.append(checkHour6(hour))
stuff.append(checkHour7(hour))
stuff.append(checkHour8(hour))
stuff.append(checkHour9(hour))
stuff.append(checkHour10(hour))
df.append(stuff)
May be a better solution to the whole process though.
start off with a data column (what hour is it)
then all the other comparisons can be queried from that.
import pandas as pd
df = pd.DataFrame(range(24), columns= ['data'])
for time in range(6,11):
df[f'hour_{time}'] = df['data']%24==time
df = df.astype(int)
If you want you can remove the data column later.
data hour_6 hour_7 hour_8 hour_9 hour_10
0 0 0 0 0 0 0
1 1 0 0 0 0 0
2 2 0 0 0 0 0
3 3 0 0 0 0 0
4 4 0 0 0 0 0
5 5 0 0 0 0 0
6 6 1 0 0 0 0
7 7 0 1 0 0 0
8 8 0 0 1 0 0
9 9 0 0 0 1 0
10 10 0 0 0 0 1
11 11 0 0 0 0 0
12 12 0 0 0 0 0
13 13 0 0 0 0 0
14 14 0 0 0 0 0
15 15 0 0 0 0 0
16 16 0 0 0 0 0
17 17 0 0 0 0 0
18 18 0 0 0 0 0
19 19 0 0 0 0 0
20 20 0 0 0 0 0
21 21 0 0 0 0 0
22 22 0 0 0 0 0
23 23 0 0 0 0 0
Because the object model in numpy and pandas differs from general Python, consider avoiding building objects in a loop like you would with simpler iterables like list or dict.
In fact, your setup can be handled with simply DataFrame.pivot with a column of 24 sequential integers without any function or loop! In fact, you can return more hour columns (i.e., hour_0-hour_24) easily or reindex for your needed five columns:
Data
df = (pd.DataFrame({'hour': ['hour' for _ in range(24)]})
.assign(hour = lambda x: x['hour'] + '_' + pd.Series(range(24)).astype('str'),
num = 1)
)
df3.head(5)
# hour num
# 0 hour_0 1
# 1 hour_1 1
# 2 hour_2 1
# 3 hour_3 1
# 4 hour_4 1
Pivot
pvt_df = (df.pivot(columns='hour', values='num')
.fillna(0)
.reindex(['hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10'], axis='columns')
)
pvt_df
# hour hour_6 hour_7 hour_8 hour_9 hour_10
# 0 0.0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0 0.0
# 3 0.0 0.0 0.0 0.0 0.0
# 4 0.0 0.0 0.0 0.0 0.0
# 5 0.0 0.0 0.0 0.0 0.0
# 6 1.0 0.0 0.0 0.0 0.0
# 7 0.0 1.0 0.0 0.0 0.0
# 8 0.0 0.0 1.0 0.0 0.0
# 9 0.0 0.0 0.0 1.0 0.0
# 10 0.0 0.0 0.0 0.0 1.0
# 11 0.0 0.0 0.0 0.0 0.0
# 12 0.0 0.0 0.0 0.0 0.0
# 13 0.0 0.0 0.0 0.0 0.0
# 14 0.0 0.0 0.0 0.0 0.0
# 15 0.0 0.0 0.0 0.0 0.0
# 16 0.0 0.0 0.0 0.0 0.0
# 17 0.0 0.0 0.0 0.0 0.0
# 18 0.0 0.0 0.0 0.0 0.0
# 19 0.0 0.0 0.0 0.0 0.0
# 20 0.0 0.0 0.0 0.0 0.0
# 21 0.0 0.0 0.0 0.0 0.0
# 22 0.0 0.0 0.0 0.0 0.0
# 23 0.0 0.0 0.0 0.0 0.0