For example, I have a dataframe below with multiple columns and rows in which the last column only has data for some of the rows. How can I take that last column and write it to a new dataframe while removing the empty cells that would remain if I just copied the entire column?
Part Number Count Miles
2345125 14 543
5432545 12
6543654 6 112
6754356 22
5643545 6
7657656 8 23
7654567 11 231
3455434 34 112
The data frame I want to obtain would be below
Miles
543
112
23
231
112
I've tried converting the empty cells to NaN and then removing, but I always either get a key error or fail to remove the rows I want. Thanks for any help.
# copy the column
series = df['Miles']
# drop nan values
series = series.dropna()
# one-liner
series = df['Miles'].dropna()
Do you mean:
df.loc[df.Miles.notna(), 'Miles']
Or if you want to drop the rows:
df = df[df.Miles.notna()]
Related
TLDR: How to save .txt data without delimiter in dataframe where each value array has a different length and is date depended.
I've got a fairly big data set saved in a .txt file with no delimiter in the following format:
id DateTime 4 84 464 8 64 874 5 854 652 1854 51 84 521 [. . .] 98 id DateTime 45 5 5 456 46 4 86 45 6 48 6 42 84 5 42 84 32 8 6 486 4 253 8 [. . .]
id and DateTime are numbers as well but ive written them in strings for readability here.
The length between the first id DateTime combination and the next is variable and not all values start/end on the same date.
Right now what I do is use .read_csv whith delimiter=" " which results in a three column DataFrame with id, DateTime and Values all stacked upon each other:
id DateTime Value
10 01.01 78
10 02.01 781
10 03.01 45
[:]
220 05.03 47
220 06.03 8
220 07.03 12
[:]
Then I create a dictionary for each id with the respective DateTime and their Values with dict[id]= df["Value"][df["id"]==id] resulting in a dictionary with keys as id.
Sadly using .from_dict() doesn't work here because each value list is of different length. To solve this I create a np.zeros() that is bigger than the biggest of the value arrays from the dictionary and save the values for each id inside a new np.array based on their DateTime. Those new arrays are then combined in a new data frame resulting in a lot of rows populated with zeros.
Desired output is:
A DataFrame with each column representing a id and their values.
First column as the overall Timeframe of the data set. Bascilly min(DateTime) to max(DateTime)
Rows in a column where no values exist should be NaN
This seems to be a lot of hassle for something that is in structure so simple (see original format). Besides that, it's quite slow. There must be a way to save the data inside a DataFrame based upon the DateTime leaving unpopulated areas with NaN.
What is a (if possible) more optimal solution for my issue?
from what i understand this should work:
for id in df.id.unique():
df[str(id)] = df.id.where(df.id == id)
I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])
I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).
Ive updated the below information to be a little clearer as per the comments:
I have the following dataframe df (it has 38 columns this is only the last few):
Col # 33 34 35 36 37 38
id 09.2018 10.2018 11.2018 12.2018 LTx LTx2
123 0.505 0.505 0.505 0.505 33 35
223 2.462 2.464 0.0 30.0 33 36
323 1.231 1.231 1.231 1.231 33 35
423 0.859 0.855 0.850 0.847 33 36
I am trying to create a new column which is the sum of a slice using iloc so an example for col 123 it would look like the following:
df['LTx3'] = (df.iloc[:, 33:35]).sum(axis=1)
This is perfect obviously for 123 but not for 223. I had assumed this would work:
df['LTx3'] = (df.iloc[:, 'LTx':'LTx2']).sum(axis=1)
But consistantly get the same error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [LTx] of <class 'str'>
I had been trying some variation of this such as below but unfortunatley also havent led to any working solution:
df['LTx3'] = (df.iloc[:, df.columns.get_loc('LTx'):df.columns.get_loc('LTx2')]).sum(axis=1)
Basically columns LTx and LTx2 are made up of integres but vary row to row. I want to use these integers as the references for the slice - Im not quite certain how I should do this.
If anyone could help lead me to a solution it would be fantastic!
Cheers
I'd recommend reading up on .loc, .iloc slicing in pandas:
https://pandas.pydata.org/pandas-docs/stable/indexing.html
.loc selects based on name(s). .iloc selects based on index (numerical) position.
You can also subset based on column names. Note also that depending on how you create your dataframe, you may have numbers cast as strings.
To get the row corresponding to 223:
df3 = df[df['Col'] == '223']
To get the columns corresponding to the names 33, 34, and 45:
df3 = df[df['Col'] == '223'].loc[:, '33':'35']
If you want to select rows wherein any column contains a given string, I found this solution: Most concise way to select rows where any column contains a string in Pandas dataframe?
df[df.apply(lambda row: row.astype(str).str.contains('LTx2').any(), axis=1)]
I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495