My data file looks like this:
data.txt
user,activity,timestamp,x-axis,y-axis,z-axis
0,33,Jogging,49105962326000,-0.6946376999999999,12.680544,0.50395286;
1,33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
2,33,Jogging,49106112167000,4.903325,10.882658000000001,-0.08172209;
3,33,Jogging,49106222305000,-0.61291564,18.496431,3.0237172;
As can be seen, the last column ends with a semicolon, so when I read into pandas, the column is inferred as type object (ending with the semicolon.
df = pd.read_csv('data.txt')
df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.50395286;
1 33 Jogging 49106062271000 5.012288 11.264028 0.95342433;
2 33 Jogging 49106112167000 4.903325 10.882658 -0.08172209;
3 33 Jogging 49106222305000 -0.612916 18.496431 3.0237172;
How do I make pandas ignore that semicolon?
The problem with your txt is that it has mixed content. As I can see the header doesn't have the semicolon as termination character
If you change the first line adding the semicolon it's quite simple
pd.read_csv("data.txt", lineterminator=";")
Might not be the case but it works given the example.
In the docs you could find comment param that:
indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.
So if ; could only be found at the end of your last column:
>>> df = pd.read_csv("data.txt", comment=";")
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 4 non-null int64
1 activity 4 non-null object
2 timestamp 4 non-null int64
3 x-axis 4 non-null float64
4 y-axis 4 non-null float64
5 z-axis 4 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 224.0+ bytes
>>> df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717
You can make use of converters param:
to parse your string
to replace ;
to convert to float
df = pd.read_csv('data.txt', sep=",", converters={"z-axis": lambda x: float(x.replace(";",""))})
print(df)
data txtuser activity timestamp x-axis y-axis z-axis
0 0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717
I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()
df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.
Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.
I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)
given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F