pandas - read_csv with missing values in headline - python

I have this kind of csv file:
date,a,b,c
2014,12,29,7,12,45
2014,12,30,7,13,12
2014,12,31,6.5,6,5
So the first row does not explicitly specify all columns, and kind of assumes that you understand that the date is the first 3 columns.
How do I tell read_csv to consider the first three columns as one date column (while keeping the other labels)?

You can parse your columns directly as a date, if you use the parse_dates argument.
From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict, default False
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. If [[1, 3]] -> combine
columns 1 and 3 and parse as a single date column. {‘foo’ : [1, 3]} ->
parse columns 1, 3 as date and call result ‘foo’ A fast-path exists
for iso8601-formatted dates.
For your file, you can do something like this:
pd.read_csv(file_path, names=['y', 'm', 'd', 'a', 'b', 'c'], header=0,
parse_dates={'date': [0, 1, 2]}, index_col='date', )
a b c
date
2014-12-29 7.0 12 45
2014-12-30 7.0 13 12
2014-12-31 6.5 6 5
The thing with the missing values in headline is solved by passing the names argument and header=0 (to overwrite the existing header). Then it is possible to specify which columns should be parsed as a date.
See another example here.

Related

Pandas: Selecting columns (with regex) and renaming them (with a list)

I am trying to do 2 things in Python:
Select the names of specific columns using a regex
Rename these selected columns using a list of names (the names are unfortunately stored in their own weird dataframe)
I am new to python and pandas but did a bunch of googling and am getting the TypeError: Index does not support mutable operations error. Here's what I am doing.
import pandas as pd
import numpy as np
df=pd.DataFrame(data=np.array([
[1, 3, 3, 4, 5,9,5],
[1, 2, 4, 4, 5,8,4],
[1, 2, 3, 'a', 5,7,3],
[1, 2, 3, 4, 'e',6,2],
['f', 2, 3, 4, 5,6,1]
]),
columns=[
'a',
'car-b',
'car-c',
'car-d',
'car-e',
'car-f',
'car-g'])
#Select the NAMES of the columns that contain 'car' in them as I want to change these column names
names_to_change = df.columns[df.columns.str.contains("car")]
names_to_change
#Here is the dataset that has the names that I want to use to replace these
#This is just how the names are stored in the workflow
new_names=pd.DataFrame(data=np.array([
['new_1','new_3','new_5'],
['new_2','new_4','new_6']
]))
new_names
#My approach is to transform the new names into a list
new_names_list=pd.melt(new_names).iloc[:,1].tolist()
new_names_list
#Now I figure I would use .columns to do the replacement
#But this returnts the mutability error
df.columns[df.columns.str.contains("car")]=new_names_list
#This also returns the same error
df.columns = df.columns[df.columns.str.contains("car")].tolist()+new_names_list
Traceback (most recent call last):
File "C:\Users\zsg876\AppData\Local\Temp/ipykernel_1340/261138782.py", line 44, in <module>
df.columns[df.columns.str.contains("car")]=new_names_list
File "C:\Users\zsg876\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4585, in __setitem__
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
I tried a bunch of different methods (this was no help: how to rename columns in pandas using a list) and haven't had much luck. I am coming over from R where renaming columns was a lot simpler -- you'd just pass a vector using names().
I take it the workflow is different here? Appreciate any suggestions!
UPDATE:
This seems to do the trick, but I am not sure why exactly. I figured replacing one list with another of equal length would work, but that does not seem to be the case. Can anyone educate me here?
col_rename_dict=dict(zip(names_to_change,new_names_list))
df.rename(columns=col_rename_dict, inplace=True)
You can use df.filter(like='car').columns to get the names of columns containing car, and you can use new_names.to_numpy().T.ravel to efficiently convert the new_names dataframe into an array of the new names. Then, you can use zip and dict to convert the two arrays into a dict where the keys are the old column names and the values are the new column names. Then, simple pass that to df.rename with axis=1:
old_names = df.filter(like='car').columns
new_names = new_names.to_numpy().T.ravel()
df = df.rename(dict(zip(old_names, new_names)), axis=1)
Output:
>>> df
a new_1 new_2 new_3 new_4 new_5 new_6
0 1 3 3 4 5 9 5
1 1 2 4 4 5 8 4
2 1 2 3 a 5 7 3
3 1 2 3 4 e 6 2
4 f 2 3 4 5 6 1

How to calculate mean of specific rows in python dataframe?

I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: TagName and Samples_Value. One tag can repeat itself multiple times among rows. I want to calculate the average value for each tag and create a new dataframe with the average value for each tag. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!
Name DataType TimeStamp Value Quality
Food Float 2019-01-01 13:00:00 105.75 122
Food Float 2019-01-01 17:30:00 11.8110352 122
Food Float 2019-01-01 17:45:00 12.7932892 122
Water Float 2019-01-01 14:01:00 16446.875 122
Water Float 2019-01-01 14:00:00 146.875 122
RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name object
Value object
This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.
for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()
It sounds like the groupby() functionality is what you want. You define the column where your groups are and then you can take the mean() of each group. An example from the documentation:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
In your case it would be something like this:
df.groupby('TagName')['Samples_value'].mean()
Edit: So, I applied the code to your provided input dataframe and following is the output:
TagName
Steam 1.081447e+06
Utilities 3.536931e+05
Name: Sample_value, dtype: float64
Is this what you are looking for?
You don't need to walk through the rows, you can just take all of the fields that match your criteria
d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
#iterate over all unique entries in col1
for entry in df["col1"].unique():
# get all the col2 values where col1 is the current iter of col1 entries
meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
print(meanofcurrententry)
This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing
You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...
groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):
df.groupby('TagName')['Sample_value'].mean().reset_index()
it gives as expected:
TagName Sample_value
0 Steam 1.081447e+06
1 Utilities 3.536931e+05
Details on the magic words:
groupby: identifies the column(s) used to group the rows (same values)
['Sample_values']: restrict the groupby object to the column of interest
mean(): computes the mean per group
reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns

Why can't I set a series type to equal another series type with Python pandas

I'm fairly new to python so forgive me if this seems like a simple question.
I have a dataframe. My goal is to take the values of a dataframe and convert it into another type and replace that column. Here is the codes:
strtotime = {}
for x in range(0,len(results['CreationDate'])):
strtotime[x] = datetime.strptime(results['CreationDate'][x], '%Y-%m-%dT%H:%M:%S.%f')
results['CreationDate'] = pd.to_datetime(pd.Series(strtotime))
I stored the values as a dictionary, converted it to a series using pd.Series, at which point I'm fairly certain I can just replace one series with another:
i.e results['CreationDate'] = pd.to_datetime(pd.Series(strtotime))
but what I get in return for results is a column of NaT instead of these neat datetimes 2015-01-01 10:59:37.403.
I then used results['CreationDate'] = list(pd.to_datetime(pd.Series(strtotime)))
which worked perfectly as I wanted it to be. So my question is why is this the case? Does it even have anything to do with object types?
When you assign a Series to a DataFrame column, pandas matches the new values according to the index. Your original DataFrame presumably has some meaningful index, but your new Series it just has the default index of 0, 1, 2, 3... because those are the keys in your dictionary. Here is a simple example:
>>> d = pandas.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[10, 11, 12])
>>> d
A B
10 1 4
11 2 5
12 3 6
>>> d["C"] = pandas.Series([8, 88, 888])
>>> d
A B C
10 1 4 NaN
11 2 5 NaN
12 3 6 NaN
>>> d["C"] = pandas.Series([8, 88, 888], index=[10, 11, 12])
>>> d
A B C
10 1 4 8
11 2 5 88
12 3 6 888
Notice that assigning a series with the wrong index resulted in NaN, but creating the new Series with the same index results in the values being put in as expected.
In your case, you are creating your new Series by applying a function to each element of the original column. Don't iterate to do that. Use the .map method. In this case, there is a builtin pandas function to convert a string to a datetime:
results['CreationDate'] = results['CreationDate'].map(pandas.to_datetime)
.map gives a new Series with the same index as the old. (If your dates don't parse correctly, you can apply a lambda that supplies a format argument to to_datetime.)
(As piRsquared noted in a comment, to_datetime actually accepts a Series argument, so you can just do results['CreationDate'] = pandas.to_datetime(results['CreationDate']).)

How to analyze a dataframe with multiple headers?

For example, I have a df with 3 headers. I want to analyze data from one of the columns in the first header and one of the columns in the second header. How do i do that?
It's hard to know if this will work because you haven't provided you're data but you can try this.
First access the column names
data.columns
Then isolate the corresponding columns you would like to analyze
data = data[['column_1', 'column_2']]
Index the columns based on the names that appear as the current column names, ignore the column names not currently used and just index based on the corresponding match.
You can then rename the columns.
data.columns = ['new_column_1_name', 'new_column_2_name']
You can pull them out as tuples:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[["A", "B"], ["a", "b"]])
In [12]: df
Out[12]:
A B
a b
0 1 2
1 3 4
In [13]: df[[("A", "a")]]
Out[13]:
A
a
0 1
1 3
In your case it might be:
df[[("Year", "All ages")]]
See the advanced section of the docs for multi-index indexing and slicing.

reindex to add missing dates to pandas dataframe

I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.
Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.

Categories

Resources