Pandas read_csv: decimal and delimiter is the same character - python

Recently I'm struggling to read an csv file with pandas pd.read_csv.
The problem is, that in the csv file a comma is used both as decimal point and as separator for columns.
The csv looks as follows:
wavelength,intensity
390,0,382
390,1,390
390,2,400
390,3,408
390,4,418
390,5,427
390,6,437
390,7,447
390,8,457
390,9,468
Pandas accordingly always splits the data into three separate columns. However the first comma is only the decimal point.
I want to plot it with the wavelength (x-axis) with 390.0, 390.1, 390.2 nm and so on.
I must somehow tell pandas, that the first comma in line is the decimal point, and the second one is the separator.
How do I do this?
Best

I'm not sure that this is possible. It almost is, as you can see by the following example:
>>> pd.read_csv('test.csv', engine='python', sep=r',(?!\d+$)')
wavelength intensity
0 390 0,382
1 390 1,390
2 390 2,400
3 390 3,408
4 390 4,418
5 390 5,427
6 390 6,437
7 390 7,447
8 390 8,457
9 390 9,468
...but the wrong comma is being split. I'll keep trying to see if it's possible ;)
Meanwhile, a simple solution would be to take advantage of the fact that that pandas puts part of the first column in the index:
df = (pd.read_csv('test.csv')
.reset_index()
.assign(wavelength=lambda x: x['index'].astype(str) + '.' + x['wavelength'].astype(str))
.drop('index', axis=1)
.astype({'wavelength': float}))
Output:
>>> df
wavelength intensity
0 390.0 382
1 390.1 390
2 390.2 400
3 390.3 408
4 390.4 418
5 390.5 427
6 390.6 437
7 390.7 447
8 390.8 457
9 390.9 468
EDIT: It is possible!
The following regular expression with a little dropna column-wise gets it done:
df = pd.read_csv('test.csv', engine='python', sep=r',(!?\w+)$').dropna(axis=1, how='all')
Output:
>>> df
wavelength intensity
0 390,0 382
1 390,1 390
2 390,2 400
3 390,3 408
4 390,4 418
5 390,5 427
6 390,6 437
7 390,7 447
8 390,8 457
9 390,9 468

Related

How do I split a column based on strings, clean up data, then do calculations on it?

Still learning my way around Python and trying to figure out how to process some data. I've got a dataframe with 1 column that I need to extract into 3 columns of data. I don't need to keep the original column.
Here's the data - "Given Data" is the original column and I want to extract out columns A and B, then do the math for column C (A/B). Thanks for your help!
Try with str.strip and str.split:
df[["A", "B"]] = df["Given Data"].str.strip("()").str.split(" / ", expand=True).astype(int)
df["C"] = df["A"].div(df["B"])
>>> df
Given Data A B C
0 (313 / 321) 313 321 0.975078
1 (654 / 654) 654 654 1.000000
2 (673 / 842) 673 842 0.799287
3 (342 / 402) 342 402 0.850746
4 (586 / 774) 586 774 0.757106
If you want to convert the numeric "C" column to percentage strings, you can do:
df["C"] = df["C"].mul(100).map("{:.2f}%".format)
>>> df
Given Data A B C
0 (313 / 321) 313 321 97.51%
1 (654 / 654) 654 654 100.00%
2 (673 / 842) 673 842 79.93%
3 (342 / 402) 342 402 85.07%
4 (586 / 774) 586 774 75.71%

Sort Index of data frame alphabetically

I am reading in a data frame from a csv file and I am trying to create a time graph of when the tickets were issued by the frequency of tickets issued. The column containing the times is set in a format of hours with a letter indicating am or pm i.e 1200A. Because of this when I try sorting the data frame in ascending order only the numerical value is considered and the A, P is disregarded. How can I sort the index of my data frame to consider the A and P
I have tried using sort_index
function and this works but only in sorting the numbers
from matplotlib
import pyplot as plt
import pandas as pd
tickets = pd.read_csv("./Parking_Violations_Issued_-_Fiscal_Year_2019.csv")
d2=tickets['Violation Time'].value_counts()
df2=d2.sort_index(ascending=1, sort_remaining='true')
Sample dataset:
Index Violation Time
.847A 1
0000A 801
0000P 22
0001A 545
0001P 1
0002A 499
0003A 520
0004A 498
0004P 1
0005A 619
0006A 983
0007A 993
0008A 1034
0008P 1
0009A 1074
Original CSV link
This will do your job.
Explanation:
First, I converted your time column with tuple, like [('.847', 'A'), ('0000', 'A'), ('0001', 'A') ...
Next, I have sorted according to your logic i.e., second element('A', 'P') of tuple and then first element(numbers) and Joined those tuples to get back to its original state.
Finally merged with the original dataset to get required output.
Code:
>>> tickets # Assuming your initial dataframe looks like below, as mentioned in OP
Index Violation Time
0 .847A 1
1 0000A 801
2 0000P 22
3 0001A 545
4 0001P 1
5 0002A 499
6 0003A 520
7 0004A 498
8 0004P 1
9 0005A 619
10 0006A 983
11 0007A 993
12 0008A 1034
13 0008P 1
>>> final_df = pd.DataFrame(["".join(i) for i in sorted(tickets.apply(lambda x: (x['Index'][:-1], x['Index'][-1]), axis=1), key=lambda x : (x[1], x[0]))])
>>> df2.rename(columns={0:'Index'}, inplace=True)
Output:
>>> final_df.merge(tickets)
Index Violation Time
0 .847A 1
1 0000A 801
2 0001A 545
3 0002A 499
4 0003A 520
5 0004A 498
6 0005A 619
7 0006A 983
8 0007A 993
9 0008A 1034
10 0009A 1074
11 0000P 22
12 0001P 1
13 0004P 1
14 0008P 1
I would consider writing an algorithm to parse the time strings into the sorting order you would like.
If indeed every Violation Time has an A or P at the last character, you could create a new sorting column which parses the time string into a datetime object. Depending on how dirty the data is, you will have to add some additional parsing checks for the hour and minute substrings, but here is a good start:
EDIT: I added in checks for length and string type to ensure the string is parseable before parsing.
from datetime import datetime
import pandas as pd
def parseDateTime(x, tformat='%I%M%p'):
if pd.isnull(x):
return None
if type(x) is str and len(x) == 5:
if x[0:2].isdigit() and x[2:4].isdigit():
newString = str(x).strip() + 'M'
parsedDateTime = datetime.strptime(newString,tformat)
return parsedDateTime
else:
return None
Note that without date information, the times will all be treated as being on the same day.
Now, you can apply this function to the column and then use the new parsed column for your sorting purposes.
tickets['Violation Time Parsed'] = tickets['Violation Time'].apply(parseDateTime)

Python plot data against date

I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.
As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.

find value in column and based on it create a new dataframe in pandas

I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe:
flag №
2017-18 389
2017-19 390
2017-20 391
2017-21 392
2017-22 393
2017-23 394
...
I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this:
№_new
391
392
393
394
395
...
442
443
444
It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?
You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0].
Last create new DataFrame with numpy.arange.
fg = '2017-20'
val = df.loc[df['flag'] == fg, '№'].values[0]
print (val)
391
df1 = pd.DataFrame({'№_new':np.arange(val, val+53)})
print (df1)
№_new
0 391
1 392
2 393
3 394
4 395
5 396
6 397
7 398
8 399
9 400
10 401
11 402
..
..

"ValueError: labels ['timestamp'] not contained in axis" error

I have this code ,i want to remove the column 'timestamp' from the file :u.data but can't.It shows the error
"ValueError: labels ['timestamp'] not contained in axis"
How can i correct it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
data = pd.read_table('u.data')
data.columns=['userID', 'itemID','rating', 'timestamp']
data.drop('timestamp', axis=1)
N = len(data)
print data.shape
print list(data.columns)
print data.head(10)
One of the biggest problem that one faces and that undergoes unnoticed is that in the u.data file while inserting headers the separation should be exactly the same as the separation between a row of data. For example if a tab is used to separate a tuple then you should not use spaces. In your u.data file add headers and separate them exactly with as many whitespaces as were used between the items of a row.
PS: Use sublime text, notepad/notepad++ does not work sometimes.
"ValueError: labels ['timestamp'] not contained in axis"
You don't have headers in the file, so the way you loaded it you got a df where the column names are the first rows of the data. You tried to access colunm timestamp which doesn't exist.
Your u.data doesn't have headers in it
$head u.data
196 242 3 881250949
186 302 3 891717742
So working with column names isn't going to be possible unless add the headers. You can add the headers to the file u.data, e.g. I opened it in a text editor and added the line a b c timestamp at the top of it (this seems to be a tab-separated file, so be careful when added the header not to use spaces, else it breaks the format)
$head u.data
a b c timestamp
196 242 3 881250949
186 302 3 891717742
Now your code works and data.columns returns
Index([u'a', u'b', u'c', u'timestamp'], dtype='object')
And the rest of the trace of your working code is now
(100000, 4) # the shape
['a', 'b', 'c', 'timestamp'] # the columns
a b c timestamp # the df
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
5 298 474 4 884182806
6 115 265 2 881171488
7 253 465 5 891628467
8 305 451 3 886324817
9 6 86 3 883603013
If you don't want to add headers
Or you can drop the column 'timestamp' using it's index (presumably 3), we can do this using df.ix below it selects all rows, columns index 0 to index 2, thus dropping the column with index 3
data.ix[:, 0:2]
i would do it this way:
data = pd.read_table('u.data', header=None,
names=['userID', 'itemID','rating', 'timestamp'],
usecols=['userID', 'itemID','rating']
)
Check:
In [589]: data.head()
Out[589]:
userID itemID rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1

Categories

Resources