Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']
I have a DataFrame df which has 50 columns in it and it has 28800 rows. I want to add a new column col_new which will have value 0 in every rows from 2880 to 5760 ,12960 to 15840 and 23040 to 25920. And all other rows will have value 1.
How could I do that?
Believe what you are looking for is actually answered here: Add column in dataframe from list
myList = [1,2,3,4,5]
print(len(df)) # 50
df['new_col'] = mylist
print(len(df)) # 51
Alternatively, you could set the value of a slice of the list like so:
data['new_col'] = 1
data.loc[2880:5760, 'new_col'] = 0
data.loc[12960:15840, 'new_col'] = 0
data.loc[23040:25920, 'new_col'] = 0
df = pd.DataFrame([i for i in range(28800)])
df["new_col"] = 1
zeros_bool = [(
(i>=(2880-1) and i<5760) or (i>=(12960-1) and i<15840) or (i>=(23040-1) and i<25920)
) for i in range(28800)]
df.loc[zeros_bool,"new_col"] = 0
At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9
180762508,1268510763,374723980,293,20180402035748,198,25,1,1
180762508,1268503685,374717256,307,20180402035758,225,38,1,1
180762508,1268492506,374708540,236,20180402035808,222,52,1,1
180762508,1268485868,374697563,248,20180402035818,197,47,1,1
180762508,1268482430,374688520,272,20180402035828,196,31,1,1
180707764,1270608366,374988433,246,20180402035925,66,37,1,0
180707764,1270620899,374992366,222,20180402035935,68,49,1,0
first column is unique id and the last column is my interest
I wanna know how can I find last column is changed from 0 to 1
I made a really big data frame with this dataset in pandas
import glob
import pandas as pd
path = r"1\1"
allFiles = glob.glob(path+"\*.DAT")
list=[]
for filename in allFiles:
df = pd.read_csv(filename, header = None)
list.append(df)
a = pd.concat(list)
a.head()
this is all I did
I don't have error but I wanna know the algorithm that I can find the last columns' value changed in each unique id
my goal is made a data frame that
first column is unique id and second, third column is latitude, longitude which is in third, second columns in my dataset and the time stamp which is in 5th columns that last column's value is changed from 0 to 1
If I understood you, you need to get the 5th row, where the change from 0 to 1, in the last column, takes place.
I made a dataframe with your first and last column (by the way, you said the 1st column is some kind of unique id, but I see repeated numbers), anyway based on your sample data, one possible solution is:
import pandas as pd
data = [[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180707764,0],[180707764,0]]
df = pd.DataFrame(data,columns=['my_id','interest'])
#new dataframe to compare the column interest
df2 = df.loc[df['interest'] != df['interest'].shift(-1)]
#output:
# my_id interest
# 4 180762508 1
# 6 180707764 0
imax = df2.index.max() #index after the change
imin = df2.index.min() #index before the change
for i in range(imin,imax,1):
i
#the row with the change in the original dataframe
print(df.loc[i])
Hi and thanks for posting. It looks like the first column doesn't have unique values, so I'm guessing you want to index returned or timestamp returned?
In any case, here's a sample of what might work for you if you want to find when the interest column for an ID changes from 0 to 1:
import pandas as pd
# Provided data
raw_str = """
180762508,1268510763,374723980,293,20180402035748,198,25,1,1 180762508,1268503685,374717256,307,20180402035758,225,38,1,1 180762508,1268492506,374708540,236,20180402035808,222,52,1,1 180762508,1268485868,374697563,248,20180402035818,197,47,1,1 180762508,1268482430,374688520,272,20180402035828,196,31,1,1 180707764,1270608366,374988433,246,20180402035925,66,37,1,0 180707764,1270620899,374992366,222,20180402035935,68,49,1,0
"""
# Replace newline and split on single whitespace
chunks = raw_str.replace('\n', '').split(' ')
# Create simple dictionary for ID, timestamp, and interest columns
ddict = {}
ddict['id'] = [i.split(',')[0] for i in chunks]
ddict['timestamp'] = [i.split(',')[4] for i in chunks]
ddict['interest'] = [i.split(',')[-1] for i in chunks]
# Convert dictionary to pandas DataFrame
df = pd.DataFrame(ddict)
# Create dictionary for sample data
# This is an existing ID with timestamp in the future and 1 as interest
tdict = {
'id': '180707764',
'timestamp': '20180402035945',
'interest': '1',
}
What df looks like:
id timestamp interest
0 180707764 20180402035925 0
1 180707764 20180402035935 0
2 180707764 20180402035945 1
3 180762508 20180402035748 1
4 180762508 20180402035758 1
5 180762508 20180402035808 1
6 180762508 20180402035818 1
7 180762508 20180402035828 1
Continuing on:
# Append that dictionary to your dataframe and sort by id, timestamp
df = df.append(pd.Series(tdict), ignore_index=True).copy(deep=True)
df = df.sort_values(['id', 'timestamp']).reset_index(drop=True)
# Shift dataframe back 1 period by rows
df2 = pd.DataFrame(df.shift(periods=-1, axis=0)
# Merge that dataframe with our original dataframe by index values
# We're dropping an extra id column and renaming our primary id column for aesthetics
df3 = df.merge(df2, left_index=True, right_index=True, suffixes=('_prev', '_curr'))
df3 = df3.drop('id_curr', axis=1).rename(columns={'id_prev': 'id'})
What df3 looks like:
id timestamp_prev interest_prev timestamp_curr interest_curr
0 180707764 20180402035925 0 20180402035935 0
1 180707764 20180402035935 0 20180402035945 1
2 180707764 20180402035945 1 20180402035748 1
3 180762508 20180402035748 1 20180402035758 1
4 180762508 20180402035758 1 20180402035808 1
5 180762508 20180402035808 1 20180402035818 1
6 180762508 20180402035818 1 20180402035828 1
7 180762508 20180402035828 1 NaN NaN
Now we can just create a conditional to return the row where interest changed from 0 to 1:
In[0]: df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]
Which returns:
timestamp_prev interest_prev id_curr timestamp_curr interest_curr
1 20180402035935 0 180707764 20180402035945 1
You can also return specific columns by adding those onto the end of the result set:
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]['timestamp_y']
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')][['id', 'timestamp_y']]
Or use the original dataframe (df) and .iloc to get specified data:
df.iloc[df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')].index, :]
Out:
id timestamp interest
1 180707764 20180402035935 0
I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])