I am reading in a data frame from a csv file and I am trying to create a time graph of when the tickets were issued by the frequency of tickets issued. The column containing the times is set in a format of hours with a letter indicating am or pm i.e 1200A. Because of this when I try sorting the data frame in ascending order only the numerical value is considered and the A, P is disregarded. How can I sort the index of my data frame to consider the A and P
I have tried using sort_index
function and this works but only in sorting the numbers
from matplotlib
import pyplot as plt
import pandas as pd
tickets = pd.read_csv("./Parking_Violations_Issued_-_Fiscal_Year_2019.csv")
d2=tickets['Violation Time'].value_counts()
df2=d2.sort_index(ascending=1, sort_remaining='true')
Sample dataset:
Index Violation Time
.847A 1
0000A 801
0000P 22
0001A 545
0001P 1
0002A 499
0003A 520
0004A 498
0004P 1
0005A 619
0006A 983
0007A 993
0008A 1034
0008P 1
0009A 1074
Original CSV link
This will do your job.
Explanation:
First, I converted your time column with tuple, like [('.847', 'A'), ('0000', 'A'), ('0001', 'A') ...
Next, I have sorted according to your logic i.e., second element('A', 'P') of tuple and then first element(numbers) and Joined those tuples to get back to its original state.
Finally merged with the original dataset to get required output.
Code:
>>> tickets # Assuming your initial dataframe looks like below, as mentioned in OP
Index Violation Time
0 .847A 1
1 0000A 801
2 0000P 22
3 0001A 545
4 0001P 1
5 0002A 499
6 0003A 520
7 0004A 498
8 0004P 1
9 0005A 619
10 0006A 983
11 0007A 993
12 0008A 1034
13 0008P 1
>>> final_df = pd.DataFrame(["".join(i) for i in sorted(tickets.apply(lambda x: (x['Index'][:-1], x['Index'][-1]), axis=1), key=lambda x : (x[1], x[0]))])
>>> df2.rename(columns={0:'Index'}, inplace=True)
Output:
>>> final_df.merge(tickets)
Index Violation Time
0 .847A 1
1 0000A 801
2 0001A 545
3 0002A 499
4 0003A 520
5 0004A 498
6 0005A 619
7 0006A 983
8 0007A 993
9 0008A 1034
10 0009A 1074
11 0000P 22
12 0001P 1
13 0004P 1
14 0008P 1
I would consider writing an algorithm to parse the time strings into the sorting order you would like.
If indeed every Violation Time has an A or P at the last character, you could create a new sorting column which parses the time string into a datetime object. Depending on how dirty the data is, you will have to add some additional parsing checks for the hour and minute substrings, but here is a good start:
EDIT: I added in checks for length and string type to ensure the string is parseable before parsing.
from datetime import datetime
import pandas as pd
def parseDateTime(x, tformat='%I%M%p'):
if pd.isnull(x):
return None
if type(x) is str and len(x) == 5:
if x[0:2].isdigit() and x[2:4].isdigit():
newString = str(x).strip() + 'M'
parsedDateTime = datetime.strptime(newString,tformat)
return parsedDateTime
else:
return None
Note that without date information, the times will all be treated as being on the same day.
Now, you can apply this function to the column and then use the new parsed column for your sorting purposes.
tickets['Violation Time Parsed'] = tickets['Violation Time'].apply(parseDateTime)
Related
Recently I'm struggling to read an csv file with pandas pd.read_csv.
The problem is, that in the csv file a comma is used both as decimal point and as separator for columns.
The csv looks as follows:
wavelength,intensity
390,0,382
390,1,390
390,2,400
390,3,408
390,4,418
390,5,427
390,6,437
390,7,447
390,8,457
390,9,468
Pandas accordingly always splits the data into three separate columns. However the first comma is only the decimal point.
I want to plot it with the wavelength (x-axis) with 390.0, 390.1, 390.2 nm and so on.
I must somehow tell pandas, that the first comma in line is the decimal point, and the second one is the separator.
How do I do this?
Best
I'm not sure that this is possible. It almost is, as you can see by the following example:
>>> pd.read_csv('test.csv', engine='python', sep=r',(?!\d+$)')
wavelength intensity
0 390 0,382
1 390 1,390
2 390 2,400
3 390 3,408
4 390 4,418
5 390 5,427
6 390 6,437
7 390 7,447
8 390 8,457
9 390 9,468
...but the wrong comma is being split. I'll keep trying to see if it's possible ;)
Meanwhile, a simple solution would be to take advantage of the fact that that pandas puts part of the first column in the index:
df = (pd.read_csv('test.csv')
.reset_index()
.assign(wavelength=lambda x: x['index'].astype(str) + '.' + x['wavelength'].astype(str))
.drop('index', axis=1)
.astype({'wavelength': float}))
Output:
>>> df
wavelength intensity
0 390.0 382
1 390.1 390
2 390.2 400
3 390.3 408
4 390.4 418
5 390.5 427
6 390.6 437
7 390.7 447
8 390.8 457
9 390.9 468
EDIT: It is possible!
The following regular expression with a little dropna column-wise gets it done:
df = pd.read_csv('test.csv', engine='python', sep=r',(!?\w+)$').dropna(axis=1, how='all')
Output:
>>> df
wavelength intensity
0 390,0 382
1 390,1 390
2 390,2 400
3 390,3 408
4 390,4 418
5 390,5 427
6 390,6 437
7 390,7 447
8 390,8 457
9 390,9 468
I´ve created a new column in a Dataframe that contains the categorical feature 'QD' which describes in which "decile" (the 10%, 20, 30% lower values) the value of another feature of the DataFrame is positioned. You can see the DF head below:
EPS CPI POC Vendeu Delta QD
1 20692 1 19185.30336 0 -1506.69664 QD07
8 20933 1 20433.27115 0 -499.72885 QD08
10 20393 1 20808.04948 0 415.04948 QD10
18 20503 1 19153.45978 0 -1349.54022 QD07
19 20587 1 20175.31906 1 -411.68094 QD09
Data Frame Head
The 'QD' column was created through the function below:
minimo = DF['EPS'].min()
passo = (DF['EPS'].max() - DF['EPS'].min())/10
def get_q(value):
for i in range(1,11):
if value < (minimo + (i*passo)):
return str('Q' + str(i).zfill(2))
Function applied on 'Delta'
Analyzing this column, I noticed something strange:
AUX2['QD'].unique()
out:
array(['QD07', 'QD08', 'QD10', 'QD09', 'QD06', 'QD05', 'QD04', 'QD03',
'QD02', 'QD01', None], dtype=object)
'QD' unique values
de .unique() method returns an array with an none value on it. At first I thought there was something wrong with the function, but when I tried to grab the position of the none value, look:
AUX2['QD'].value_counts()
out:
QD05 852
QD04 848
QD06 685
QD03 578
QD07 540
QD08 377
QD02 318
QD09 209
QD10 68
QD01 61
Name: QD, dtype: int64
.value_counts()
len(AUX2[AUX2['QD'] == None]['QD'])
out:
0
len()
What am I missing here?
When you are using .value_counts() add dropna=False
df[df['name column'].isnull()]
I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.
As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.
I have this code ,i want to remove the column 'timestamp' from the file :u.data but can't.It shows the error
"ValueError: labels ['timestamp'] not contained in axis"
How can i correct it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
data = pd.read_table('u.data')
data.columns=['userID', 'itemID','rating', 'timestamp']
data.drop('timestamp', axis=1)
N = len(data)
print data.shape
print list(data.columns)
print data.head(10)
One of the biggest problem that one faces and that undergoes unnoticed is that in the u.data file while inserting headers the separation should be exactly the same as the separation between a row of data. For example if a tab is used to separate a tuple then you should not use spaces. In your u.data file add headers and separate them exactly with as many whitespaces as were used between the items of a row.
PS: Use sublime text, notepad/notepad++ does not work sometimes.
"ValueError: labels ['timestamp'] not contained in axis"
You don't have headers in the file, so the way you loaded it you got a df where the column names are the first rows of the data. You tried to access colunm timestamp which doesn't exist.
Your u.data doesn't have headers in it
$head u.data
196 242 3 881250949
186 302 3 891717742
So working with column names isn't going to be possible unless add the headers. You can add the headers to the file u.data, e.g. I opened it in a text editor and added the line a b c timestamp at the top of it (this seems to be a tab-separated file, so be careful when added the header not to use spaces, else it breaks the format)
$head u.data
a b c timestamp
196 242 3 881250949
186 302 3 891717742
Now your code works and data.columns returns
Index([u'a', u'b', u'c', u'timestamp'], dtype='object')
And the rest of the trace of your working code is now
(100000, 4) # the shape
['a', 'b', 'c', 'timestamp'] # the columns
a b c timestamp # the df
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
5 298 474 4 884182806
6 115 265 2 881171488
7 253 465 5 891628467
8 305 451 3 886324817
9 6 86 3 883603013
If you don't want to add headers
Or you can drop the column 'timestamp' using it's index (presumably 3), we can do this using df.ix below it selects all rows, columns index 0 to index 2, thus dropping the column with index 3
data.ix[:, 0:2]
i would do it this way:
data = pd.read_table('u.data', header=None,
names=['userID', 'itemID','rating', 'timestamp'],
usecols=['userID', 'itemID','rating']
)
Check:
In [589]: data.head()
Out[589]:
userID itemID rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1
Suppose I have the following rows in a Pandas DataFrame:
970 P-A1-1019-03-C15,15 23987896 1 8
971 P-A1-1019-06-B15,15 23251711 4 8
972 P-A1-1019-08-C15,15 12160034 2 8
973 P-A1-1020-01-D15,15 8760012 1 8
I'd like to alter the second column to remove the ",15" from the string. Desired end state would be like this:
970 P-A1-1019-03-C15 23987896 1 8
971 P-A1-1019-06-B15 23251711 4 8
972 P-A1-1019-08-C15 12160034 2 8
973 P-A1-1020-01-D15 8760012 1 8
The thing to remove won't always be ",15", as it could be ",10", ",03", ",4", etc. Additionally, some rows in the input are differently formatted, and may look like this:
4 RR00-0,2020338 24380076 4 12
5 RR00-0,2020738 10562767 2 12
6 ,D 24260808 1 12
7 ,D 23521158 1 12
Initially, I'm only interested in the cases where the string DOES fit the form of "P-A1-1019-03-C15", so it would be nice to be able to drop rows which don't match that specific format.
Is there a built in way to do this kind of processing, or will I need to iterate over every row manually?
This should remove all ',15' values:
dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if [value].split(',')[0] == '15' else value)
This should remove all ',15' values if they are in the format you provided:
dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if ([value].split(',')[0] == '15') & ('P-A1-' in value) else value)