I'm using Python 3.4.0 with pandas==0.16.2. I have soccer team results in CSV files that have the following columns: date, at, goals.scored, goals.lost, result. The 'at' column can have one of the three values (H, A, N) which indicate whether the game took place at team's home stadium, away or at a neutral place respectively. Here's the head of one such file:
date,at,goals.scored,goals.lost,result
16/09/2014,A,0,2,2
01/10/2014,H,4,1,1
22/10/2014,A,2,1,1
04/11/2014,H,3,3,0
26/11/2014,H,2,0,1
09/12/2014,A,4,1,1
25/02/2015,H,1,3,2
17/03/2015,A,2,0,1
19/08/2014,A,0,0,0
When I load this file into pandas.DataFrame in the usual way:
import pandas as pd
aTeam = pd.DataFrame.from_csv("teamA-results.csv")
the first two columns 'date' and 'at' seem to be treated as one and I get a malformed data frame like this one:
aTeam.dtypes
at object
goals.scored int64
goals.lost int64
result int64
dtype: object
aTeam
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
...
The code block does not clearly reflect the corruption, so I attached the screenshot from the Jupyter notebook:
As you can see 'date' and 'at' columns seemed to be treated as one column of object type:
aTeam['at']
date
2014-09-16 A
2014-01-10 H
2014-10-22 A
2014-04-11 H
2014-11-26 H
2014-09-12 A
Initially I thought the lack of quotes around the date was causing this problem, so I added those, but it did not help at all, so then I en-quoted all the values in the 'at' column which still did not solve the problem. I tried single and double quotes in the CSV file. Interestingly using no quotes or double quotes around the values in 'date' and 'at' produced the same results as you could see above. Single quotes were interpreted as parts of the value in the 'at' column, but not in the date column:
Adding the parse_dates=True param did not have any effect on the data frame.
I did not have such issues when I was working with these CSV files in R. I will appreciate any help on this one.
I can replicate your issue using from_csv,the issue is it uses col 0 as the index so passing index_col=None would work:
index_col : int or sequence, default 0
Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table
import pandas as pd
aTeam = pd.DataFrame().from_csv("in.csv",index_col=None)
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
Odr using .read_csv works correctly and is what you probably wanted based on the fact you were trying quotechar which is a valid arg:
import pandas as pd
aTeam = pd.read_csv("in.csv")
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
I have no problem here (using Python 2.7, Pandas 16.2). I used the text you pasted and created a .csv file on my desktop and loaded it in Pandas using two methods
import pandas as pd
a = pd.read_csv('test.csv')
b = pd.DataFrame.from_csv('test.csv')
>>>a
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
>>>b
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
2014-09-12 A 4 1 1
2015-02-25 H 1 3 2
2015-03-17 A 2 0 1
2014-08-19 A 0 0 0
You can see the different behavior between the read_csv and from_csv commands in how the index is handled. However, I do not see anything like the problem you mentioned. You could always try further defining the read_csv parameters, though I doubt that will make a substantial difference.
Could you confirm that your date and "at" columns are being smashed together by querying the dataframe via aTeam['at'] and see what that yields?
Related
I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
I have a csv file like this:
,,22-5-2021 (v_c) , 23-5-2021 (v_c)
col_a,col_b,v_c,v_d,v_c,v_d
1,1,2,4,5,6
2,2,2,3,7,6
3,3,2,5,6,5
I need to convert it to:
col_a,col_b,v_c,v_d,dates
1,1,2,4,22-5-2021
1,1,5,6,23-5-2021
2,2,2,3,22-5-2021
2,2,7,6,23-5-2021
3,3,2,5,22-5-2021
3,3,6,5,23-5-2021
or
col_a,col_b,v_c,v_d,dates
1,1,2,4,22-5-2021
2,2,2,3,22-5-2021
3,3,2,5,22-5-2021
1,1,5,6,23-5-2021
2,2,7,6,23-5-2021
3,3,6,5,23-5-2021
My approach was using df.melt, but didn't quite get it. Maybe I'm lost with how to bring dates that are for 2 columns each.
You can try via list comprehension+pd.wide_to_long():
df=pd.read_csv('etc.csv',header=1)
df.columns=[x if x.split('.')[-1].isnumeric() else x+'.0' for x in df]
df=(pd.wide_to_long(df,['v_c','v_d'],['col_a.0','col_b.0'],'drop',sep='.')
.reset_index().sort_values('drop'))
df['dates']=df.pop('drop').map({0:'22-5-2021',1:'23-5-2021'})
df.columns=df.columns.str.rstrip('.0')
output of df:
col_a col_b v_c v_d dates
0 1 1 2 4 22-5-2021
2 2 2 2 3 22-5-2021
4 3 3 2 5 22-5-2021
1 1 1 5 6 23-5-2021
3 2 2 7 6 23-5-2021
5 3 3 6 5 23-5-2021
I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
I want to make a table with all available products for every customer. However, I only have a table with the combination of product and customer if it was bought. I want to make a new table that also included the product that were not bought by the customer. The current table looks as follows:
The table I want to end up with is:
Could anyone help me how to do this in pandas?
One way to do this is to use pd.MultiIndex and reindex:
df = pd.DataFrame({'Product':list('ABCDEF'),
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
indx = pd.MultiIndex.from_product([df['Product'].unique(),
df['Customer'].unique()],
names=['Product','Customer'])
df.set_index(['Product','Customer'])\
.reindex(indx, fill_value=0)\
.reset_index()\
.sort_values(['Customer','Product'])
Output:
Product Customer Amount
0 A 1 4
3 B 1 5
6 C 1 0
9 D 1 0
12 E 1 0
15 F 1 0
1 A 2 0
4 B 2 0
7 C 2 3
10 D 2 0
13 E 2 0
16 F 2 0
2 A 3 0
5 B 3 0
8 C 3 0
11 D 3 1
14 E 3 1
17 F 3 2
You can also create a pivot to do what you want in one line. Note that the output format is different -- it's a pandas.DataFrame.pivot rather than a standard pandas data frame. But if you're not especially fussed about that (depends on how you intend to use the final table), the following code does the job.
df = pd.DataFrame({'Product':['A','B','C','D','E','F'],
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
pivot_df = df.pivot(index='Product',
columns='Customer',
values='Amount').fillna(0).astype('int')
Output:
Customer 1 2 3
Product
A 4 0 0
B 5 0 0
C 0 3 0
D 0 0 1
E 0 0 1
F 0 0 2
df.pivot creates NaN values when there are no corresponding entries in the original df (it creates a NaN value for Product A and Customer 2, for instance). NaNs are float values, so all the 'Amounts' in the pivot are implicitly converted into floats. This is why I use fillna(0) to convert the NaN values into 0s, and then finally change the dtype back to int.
I'm trying to calculate what I am calling "delta values", meaning the amount that has changed between two consecutive rows.
For example
A | delta_A
1 | 0
2 | 1
5 | 3
9 | 4
I managed to do that starting with this code (basically copied from a MatLab program I had)
df = df.assign(delta_A=np.zeros(len(df.A)))
df['delta_A'][0] = 0 # start at 'no-change'
df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
Which generates the dataframe correctly, and seems to have no further negative affects
However, I think there is something wrong with that approach becuase I get these messages.
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
.../__main__.py:5: SettingWithCopyWarning
So, I didn't really understand what that link was trying to say, and I found this post
Adding new column to existing DataFrame in Python pandas
And, as the latest edit to the answer there says to use this code, but I have already used that syntax...
df1 = df1.assign(e=p.Series(np.random.randn(sLength)).values)
So, question is - Is the loc() function the way to go, or what is the more correct way to get that column?
It seems you need diff and then replace NaN with 0:
df['delta_A'] = df.A.diff().fillna(0).astype(int)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Alternative solution with assign
df = df.assign(delta_A=df.A.diff().fillna(0).astype(int))
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Another solution if you need to replace only first NaN value:
df['delta_A'] = df.A.diff()
df.loc[df.index[0], 'delta_A'] = 0
print (df)
A delta_A
0 0 0.0
1 4 4.0
2 7 3.0
3 8 1.0
Your solution can be modified with iloc, but I think it's better to use the diff function:
df['delta_A'] = 0 # convert all values to 0
df['delta_A'].iloc[1:] = df.A[1:].values - df.A[:-1].values
#also works
#df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
print (df)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1