I'm not sure I'm going to describe this right, but I'll try.
I have several excel files with about 20 columns and 10k or so rows. Let's say the column names are in the form col1, col2...col20.
Col2 is a timestamp column, so, for instance, a value could read: "2012-07-25 14:21:00".
I want to read the excel files into a DataFrame and perform some time series and grouping operations.
Here's some simplified code to load an excel file:
xl = pd.ExcelFile(os.path.join(dirname, filename))
df = xl.parse(xl.sheet_names[0], index_col=1) # Col2 above
When I run
df.index
it gives me:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-19 15:37:55, ..., 2012-02-02 16:13:42]
Length: 9977, Freq: None, Timezone: None
as expected. However, inspecting the columns, I get:
Index([u'Col1', u'Col2',...u'Col20'], dtype='object')
Which may be why I have problems with some of the manipulation I want to do. So for instance, when I run:
df.groupby[category_col].count()
I expect to get a dataframe with 1 row for each category and 1 column containing the count for that category. Instead, I get a dataframe with 1 row for each category and 19 columns describing the number of values for that column/category pair.
The same thing happens when I try to resample:
df.resample('D', how='count')
Instead of a single column Dataframe with the number of records per day, I get:
2012-01-01 Col1 8
Col2 8
Coln 8
2012-01-02 Col1 10
Col2 10
Coln 10
Is this normal behavior? How would I instead get just one value per day, category, whichever?
Based on this blog post from Wes McKinney, I think the problem is that I have to run my operations on a specific column, namely a column that I know won't have missing data.
So instead of doing:
df.groupby[category_col].count()
I should do:
df['col3'].groupby(df[category_col]).count()
and this:
df2.resample('D', how='count')
should be this:
df2['col3'].resample('D', how='count')
The results are more inline with what I'm looking for:
Category
Cat1 1232
Cat2 7677
Cat3 1053
Date
2012-01-01 8
2012-01-02 66
2012-01-03 89
Related
I am trying to merge the following two dataframes on the date price and code columns, but there is unexpected loss of rows:
df1.head(5)
date price code
0 2015-01-03 27.65 534
1 2015-03-30 43.65 231
2 2015-01-09 24.65 132
3 2015-11-13 87.13 211
4 2015-05-12 63.25 231
df2.head(5)
date price code volume
0 2015-01-03 27.65 534 43
1 2015-03-30 43.65 231 21
2 2015-01-09 24.65 132 88
3 2015-12-25 0.00 211 130
4 2015-03-12 11.15 211 99
df1 is a subset of df2, but without the volume column.
I tried the following code, but half of the rows disappear:
master = pd.merge(df1, df2, on=['date', 'price', 'code'])
I know I can do a left join and keep all the rows, but they will be NaN for volume, which is the entire purpose of doing the join.
I'm not sure why there are missing rows after the join, as df2 contains every single date, with consistent prices and codes as with df1. They also have the same data types, where dates are strings.
df1.dtypes
date object
price float64
code int
dtype: object
df2.dtypes
date object
price float64
code int
volume int
dtype: object
When I check if a certain value exists in a dataframe, it returns False, which could be a clue as to why they don't join as expected (unless I'm checking wrong):
df2['price'][0]
> 27.65
27.65 in df2['price']
> False
I'm at a loss about what to check next - any ideas? I think there could be some sort of encoding issues or something along those lines.
I also manually checked the dataframes for the rows which weren't able to be joined and the values definitely exist.
Edit: I know it's expected for the rows to be dropped when they do not match on the columns specified in the join. The problem is that rows are dropped even when they do match (or they appear to be).
You say "df1 is a subset of df2, but without the volume column". I suppose you don't mean that their only difference is the "volume" column, which is obviously not the only difference, otherwise the question would be trivial (of just adding the "volume" column).
You just need to add the how='outer' keyword argument.
df = pd.merge(df1, df2, on=['date', 'code', 'price'], how='outer')
With your dataframe, since your column naming is consistent, you can do the following:
master = pd.merge(df1,df2, how='left')
By default, pd.merge() joins on all common columns and performs an inner join. Use how to specify the left join.
The value disappears because row 3 and 4 for column price has different value (Also, row 4 for the code column). merge default behavior is inner therefore those rows will disappear.
I have a problem with my pandas dataframe. Pandas detects duplicated rows, but there are none.
I wanted to use the pivot function but I have the error message “ValueError: Index contains duplicate entries, cannot reshape”.
So I tried to find the duplicated rows in my dataframe and when I used the duplicated() function the result were :
Number_id [...] Name Value
802 001 [...] Name1 41
809 001 [...] Name2 75
813 001 [...] Name3 13
845 001 [...] Name4 2
Obviously, those rows are not the same, for each row : Number_id, value and name are different.
My dataframe dimensions are [860 rows x 10 columns]. There are 215 Number_id, each Number_id has 4 values, one for each name. 215*4=860
I wanted to use the pivot function like this :
df.pivot(index=list_of_index_columns, columns='Name', values='Value')
The list_of_index_columns correspond to all the column of the df except Name and Value, so 8 columns.
I don’t how to handle this. Can I have some help?
I use the Spyder 3.8 version.
There is duplication in your data. For example
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['A','0','0','0','0','F'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df[df.duplicated()]
df[df.duplicated(keep=False)]
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
Dataframe image
col1 col2
0 A 2017-02-04 10:41:00.0000000
1 B 2017-02-04 10:41:00.0000000
2 C 2017-02-04 10:41:00.0000000
3 D 2017-02-04 10:41:00.0000000
4 E 2017-02-03 06:13:00.0000000
I have multiple pyspark dataframes all with columns datatype as string. I want to filter out those column names which have datetime like patterns. Suppose, I have this above given pyspark dataframe with datatype of all columns as string. I want to write a code that automatically detects columns having values in datetime format. so, in the above dataframe, it should return col2 as output.
I have tried this in Python which worked but is giving 'type error' in Pyspark.
dt_list=[]
for x in df.columns:
if df[x].astype(str).str.match(r'(\d{2,4})-(\d{1,2})-(\d{1,2})\s(\d{2}):(\d{2}):(\d{2})\.(\d+)').all():
dt_list.append(x)
When string is transformed to timestamp and doesn't match timestamp format null is returned. In your code you need to check is something timestamp and than use if and depend on that leave as string or convert to timestamp.
df = spark.sql("select if(isNull(to_timestamp('I am normal string')), 'not date', 'date') as timestamp")
display(df)
I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])