I have this code ,i want to remove the column 'timestamp' from the file :u.data but can't.It shows the error
"ValueError: labels ['timestamp'] not contained in axis"
How can i correct it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
data = pd.read_table('u.data')
data.columns=['userID', 'itemID','rating', 'timestamp']
data.drop('timestamp', axis=1)
N = len(data)
print data.shape
print list(data.columns)
print data.head(10)
One of the biggest problem that one faces and that undergoes unnoticed is that in the u.data file while inserting headers the separation should be exactly the same as the separation between a row of data. For example if a tab is used to separate a tuple then you should not use spaces. In your u.data file add headers and separate them exactly with as many whitespaces as were used between the items of a row.
PS: Use sublime text, notepad/notepad++ does not work sometimes.
"ValueError: labels ['timestamp'] not contained in axis"
You don't have headers in the file, so the way you loaded it you got a df where the column names are the first rows of the data. You tried to access colunm timestamp which doesn't exist.
Your u.data doesn't have headers in it
$head u.data
196 242 3 881250949
186 302 3 891717742
So working with column names isn't going to be possible unless add the headers. You can add the headers to the file u.data, e.g. I opened it in a text editor and added the line a b c timestamp at the top of it (this seems to be a tab-separated file, so be careful when added the header not to use spaces, else it breaks the format)
$head u.data
a b c timestamp
196 242 3 881250949
186 302 3 891717742
Now your code works and data.columns returns
Index([u'a', u'b', u'c', u'timestamp'], dtype='object')
And the rest of the trace of your working code is now
(100000, 4) # the shape
['a', 'b', 'c', 'timestamp'] # the columns
a b c timestamp # the df
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
5 298 474 4 884182806
6 115 265 2 881171488
7 253 465 5 891628467
8 305 451 3 886324817
9 6 86 3 883603013
If you don't want to add headers
Or you can drop the column 'timestamp' using it's index (presumably 3), we can do this using df.ix below it selects all rows, columns index 0 to index 2, thus dropping the column with index 3
data.ix[:, 0:2]
i would do it this way:
data = pd.read_table('u.data', header=None,
names=['userID', 'itemID','rating', 'timestamp'],
usecols=['userID', 'itemID','rating']
)
Check:
In [589]: data.head()
Out[589]:
userID itemID rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1
Related
Recently I'm struggling to read an csv file with pandas pd.read_csv.
The problem is, that in the csv file a comma is used both as decimal point and as separator for columns.
The csv looks as follows:
wavelength,intensity
390,0,382
390,1,390
390,2,400
390,3,408
390,4,418
390,5,427
390,6,437
390,7,447
390,8,457
390,9,468
Pandas accordingly always splits the data into three separate columns. However the first comma is only the decimal point.
I want to plot it with the wavelength (x-axis) with 390.0, 390.1, 390.2 nm and so on.
I must somehow tell pandas, that the first comma in line is the decimal point, and the second one is the separator.
How do I do this?
Best
I'm not sure that this is possible. It almost is, as you can see by the following example:
>>> pd.read_csv('test.csv', engine='python', sep=r',(?!\d+$)')
wavelength intensity
0 390 0,382
1 390 1,390
2 390 2,400
3 390 3,408
4 390 4,418
5 390 5,427
6 390 6,437
7 390 7,447
8 390 8,457
9 390 9,468
...but the wrong comma is being split. I'll keep trying to see if it's possible ;)
Meanwhile, a simple solution would be to take advantage of the fact that that pandas puts part of the first column in the index:
df = (pd.read_csv('test.csv')
.reset_index()
.assign(wavelength=lambda x: x['index'].astype(str) + '.' + x['wavelength'].astype(str))
.drop('index', axis=1)
.astype({'wavelength': float}))
Output:
>>> df
wavelength intensity
0 390.0 382
1 390.1 390
2 390.2 400
3 390.3 408
4 390.4 418
5 390.5 427
6 390.6 437
7 390.7 447
8 390.8 457
9 390.9 468
EDIT: It is possible!
The following regular expression with a little dropna column-wise gets it done:
df = pd.read_csv('test.csv', engine='python', sep=r',(!?\w+)$').dropna(axis=1, how='all')
Output:
>>> df
wavelength intensity
0 390,0 382
1 390,1 390
2 390,2 400
3 390,3 408
4 390,4 418
5 390,5 427
6 390,6 437
7 390,7 447
8 390,8 457
9 390,9 468
I am reading in a data frame from a csv file and I am trying to create a time graph of when the tickets were issued by the frequency of tickets issued. The column containing the times is set in a format of hours with a letter indicating am or pm i.e 1200A. Because of this when I try sorting the data frame in ascending order only the numerical value is considered and the A, P is disregarded. How can I sort the index of my data frame to consider the A and P
I have tried using sort_index
function and this works but only in sorting the numbers
from matplotlib
import pyplot as plt
import pandas as pd
tickets = pd.read_csv("./Parking_Violations_Issued_-_Fiscal_Year_2019.csv")
d2=tickets['Violation Time'].value_counts()
df2=d2.sort_index(ascending=1, sort_remaining='true')
Sample dataset:
Index Violation Time
.847A 1
0000A 801
0000P 22
0001A 545
0001P 1
0002A 499
0003A 520
0004A 498
0004P 1
0005A 619
0006A 983
0007A 993
0008A 1034
0008P 1
0009A 1074
Original CSV link
This will do your job.
Explanation:
First, I converted your time column with tuple, like [('.847', 'A'), ('0000', 'A'), ('0001', 'A') ...
Next, I have sorted according to your logic i.e., second element('A', 'P') of tuple and then first element(numbers) and Joined those tuples to get back to its original state.
Finally merged with the original dataset to get required output.
Code:
>>> tickets # Assuming your initial dataframe looks like below, as mentioned in OP
Index Violation Time
0 .847A 1
1 0000A 801
2 0000P 22
3 0001A 545
4 0001P 1
5 0002A 499
6 0003A 520
7 0004A 498
8 0004P 1
9 0005A 619
10 0006A 983
11 0007A 993
12 0008A 1034
13 0008P 1
>>> final_df = pd.DataFrame(["".join(i) for i in sorted(tickets.apply(lambda x: (x['Index'][:-1], x['Index'][-1]), axis=1), key=lambda x : (x[1], x[0]))])
>>> df2.rename(columns={0:'Index'}, inplace=True)
Output:
>>> final_df.merge(tickets)
Index Violation Time
0 .847A 1
1 0000A 801
2 0001A 545
3 0002A 499
4 0003A 520
5 0004A 498
6 0005A 619
7 0006A 983
8 0007A 993
9 0008A 1034
10 0009A 1074
11 0000P 22
12 0001P 1
13 0004P 1
14 0008P 1
I would consider writing an algorithm to parse the time strings into the sorting order you would like.
If indeed every Violation Time has an A or P at the last character, you could create a new sorting column which parses the time string into a datetime object. Depending on how dirty the data is, you will have to add some additional parsing checks for the hour and minute substrings, but here is a good start:
EDIT: I added in checks for length and string type to ensure the string is parseable before parsing.
from datetime import datetime
import pandas as pd
def parseDateTime(x, tformat='%I%M%p'):
if pd.isnull(x):
return None
if type(x) is str and len(x) == 5:
if x[0:2].isdigit() and x[2:4].isdigit():
newString = str(x).strip() + 'M'
parsedDateTime = datetime.strptime(newString,tformat)
return parsedDateTime
else:
return None
Note that without date information, the times will all be treated as being on the same day.
Now, you can apply this function to the column and then use the new parsed column for your sorting purposes.
tickets['Violation Time Parsed'] = tickets['Violation Time'].apply(parseDateTime)
I am trying to read this small data file,
Link - https://drive.google.com/open?id=1nAS5mpxQLVQn9s_aAKvJt8tWPrP_DUiJ
I am using the code -
df = pd.read_table('/Data/123451_date.csv', sep=';', index_col=0, engine='python', error_bad_lines=False)
It has ';' as a seprator, and values are missing in the file for some columns values in some observations (or rows).
How can I read it properly. I see the current dataframe, which is not loaded properly.
It looks like the data you use has some garbage in it. Precisely, rows 1-33 (inclusive) have additional, unnecessary (non-GPS) information included. You can either fix the database by manually removing the unneeded information from the datasheet, or use following code snippet to skip the rows that include it:
from pandas import read_table
data = read_table('34_2017-02-06.gpx.csv', sep=';', skiprows=list(range(1, 34)).drop("Unnamed: 28", axis=1)
The drop("Unnamed: 28", axis=1) is simply there to remove an additional column that is created probably due to each row in your datasheet ending with a ; (because it reads the empty space at the end of each line as data).
The result of print(data.head()) is then as follows:
index cumdist ele ... esttotalpower lat lon
0 49 340 -34.8 ... 9 52.077362 5.114530
1 51 350 -34.8 ... 17 52.077468 5.114543
2 52 360 -35.0 ... -54 52.077521 5.114551
3 53 370 -35.0 ... -173 52.077603 5.114505
4 54 380 -34.8 ... 335 52.077677 5.114387
[5 rows x 28 columns]
To explain the role of the drop command even more, here is what would happen without it (notice the last, weird column)
index cumdist ele ... lat lon Unnamed: 28
0 49 340 -34.8 ... 52.077362 5.114530 NaN
1 51 350 -34.8 ... 52.077468 5.114543 NaN
2 52 360 -35.0 ... 52.077521 5.114551 NaN
3 53 370 -35.0 ... 52.077603 5.114505 NaN
4 54 380 -34.8 ... 52.077677 5.114387 NaN
[5 rows x 29 columns]
I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.
As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.
I have imported data from a csv file and the data looks like this :
user_id movie_id rating ts name year
0 196 242 3 881250949 Kolya (1996) 24-Jan-1997
1 63 242 3 875747190 Kolya (1996) 24-Jan-1997
2 226 242 5 883888671 Kolya (1996) 24-Jan-1997
3 154 242 3 879138235 Kolya (1996) 24-Jan-1997
4 306 242 5 876503793 Kolya (1996) 24-Jan-1997
5 296 242 4 884196057 Kolya (1996) 24-Jan-1997
6 34 242 5 888601628 Kolya (1996) 24-Jan-1997
My code :
import sys
import pandas as pd
df = pd.read_csv(sys.stdin,delimiter='\t)
I am trying to index a column using df['rating'], and its giving me the above error.
I have also tried df.loc[:,'rating'] which is giving me the error of
the label [rating] is not in the [columns]'
When I try to get the column names using print(df.column.values), I get the error of
return object.getattribute(self, name) AttributeError: 'DataFrame'
object has no attribute 'column'
I am not sure, how to proceed from here, Any input is appreciated. Thanks.
import sys
import pandas as pd
df = pd.read_csv('your_file.csv')
df.set_index('rating', inplace=True)
The problem is with the parsing. Most plausible scenario is that your input is not really tab-separated (probably multiple spaces instead of tabs).
Try this:
df = pd.read_csv(sys.stdin, sep=' +')
print (df.columns)