I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495
Related
Most of the the other questions regarding updating values in pandas df are focused on appending a new column or just updating the cell with a new value (i.e. replacing it). My question is a bit different. Assuming my df already has values in it, and I find a new value, I need to add it into the cell to update its value. Example if a cell already has 5 and I found the value 10 in my file that corresponds to that column/row, the value should now be 15.
But I am having trouble writing this bit of code and even getting values to show up in my dataframe.
I have a dictionary, for example:
id_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], 'Pseudomonas': ['287'], 'NONE': ['2829358', '2806529']}
And I have sample id files that contain ids and the number of times those ids showed up in a previous file where the first value is the count and the second value is the id.
cat Sample1_idsummary.txt
1,162
15,174
4,195
5,197
6,201
10,2829358
Some of the ids have the same key in id_dict and I need to create a dataframe like the following:
Sample Treponema Leptospira Azospirillum Campylobacter Pseudomonas NONE
0 sample1 1 15 0 15 0 10
Here is my script, but my issue is that my output is always zero for all columns.
samplefile=sys.argv[1]
sample_ID=samplefile.split("_")[0] ## get just ID name
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
column_names=["Sample"]
column_names.extend([x for x in list(id_dict.keys())])
df = pd.DataFrame(columns=column_names)
df["Sample"]=[sample_ID]
with open(samplefile) as sf: # open the sample taxid count file
for line in sf:
id = line.split(",")[1] # the taxid (multiple can hit the same lineage info)
idcount = int(line.split(",")[0]) # the count from uniq
# For all keys in the dict, if that key is in the sample id file use the count from the id file
# Otherwise all keys not found in the file are "0" in the df
if id in id_dict:
df[list(id_dict.keys())[list(id_dict.values().index(id))]] = idcount
return df.fillna(0)
It's the very last if statement that is confusing me. How to make idcount add each time it gives the same key and why do I always get zeros filled in?
The below mentioned method worked! Here is the updated code:
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index().astype({'id':int})
iddf = pd.read_csv(samplefile, sep=",", names=["count","id"])
df=df.merge(iddf, how='outer').fillna(0).groupby('index')['count'].sum().to_frame(sample_ID).T
return df
And the output, which is still not coming up right:
index 0 Azospirillaceae Campylobacteraceae Leptospiraceae NONE Pseudomonadacea Treponemataceae
mini 106.0 0.0 20.0 0.0 0.0 0.0 5.0
UPDATE 2
With the code below and using my proper files I've managed to get the table but cannot for the life of me get the "NONE" column to show up anymore. Any suggestions? My output is essentially every key value with proper counts but "NONE" disappears.
Instead of doing that way iteratively, you can more automate and use pandas to perform those operations.
Start by creating the dataframe from id_dict:
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index()\
.astype({'id': int})
index id
0 Treponema 162
1 Leptospira 174
2 Azospirillum 192
3 Campylobacter 195
4 Campylobacter 197
5 Campylobacter 199
6 Campylobacter 201
7 Pseudomonas 287
8 NONE 2829358
9 NONE 2806529
Read the count/id text file into a data frame:
idDF = pd.read_csv('Sample1_idsummary.txt', sep=',' , names=['count', 'id'])
count id
0 1 162
1 15 174
2 4 195
3 5 197
4 6 201
5 10 2829358
Now outer merge both the dataframes, fill NaN's with 0, then groupby index, and call sum and create the dataframe calling to_frame and passing count as column name, finally transpose the dataframe:
df.merge(idDF, how='outer').fillna(0).groupby('index')['count'].sum().to_frame('Sample1').T
OUTPUT:
index Azospirillum Campylobacter Leptospira NONE Pseudomonas Treponema
Sample1 0.0 15.0 15.0 10.0 0.0 1.0
I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN
For example, I have a dataframe below with multiple columns and rows in which the last column only has data for some of the rows. How can I take that last column and write it to a new dataframe while removing the empty cells that would remain if I just copied the entire column?
Part Number Count Miles
2345125 14 543
5432545 12
6543654 6 112
6754356 22
5643545 6
7657656 8 23
7654567 11 231
3455434 34 112
The data frame I want to obtain would be below
Miles
543
112
23
231
112
I've tried converting the empty cells to NaN and then removing, but I always either get a key error or fail to remove the rows I want. Thanks for any help.
# copy the column
series = df['Miles']
# drop nan values
series = series.dropna()
# one-liner
series = df['Miles'].dropna()
Do you mean:
df.loc[df.Miles.notna(), 'Miles']
Or if you want to drop the rows:
df = df[df.Miles.notna()]
I want to shift column values one space to the left. I don't want to save the original values of the column 'average_rating'.
I used the shift command:
data3 = data3.shift(-1, axis=1)
But the output I get has missing values for two columns- 'num_pages' and 'text_reviews_count'
It is because the data types of the source and target columns do not match. Try converting the column value after shift() to the target data type for each source and target column - for example .fillna(0).astype(int).
Alternately, you can convert all the data in the data frame to strings and then perform the shift. You might want to convert them back to their original data types again.
df = df.astype(str) # convert all data to str
df_shifted = (df.shift(-1,axis=1)) # perform the shift
df_string = df_shifted.to_csv() # store the shifted to a string variable
new_df = pd.read_csv(StringIO(df_string), index_col=0) # read the data again from the string variable
Output:
average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count extra
0 3.57 0674842111 978067 en-US 236 55 6.0 NaN
1 3.60 1593600119 978067 eng 400 25 4.0 NaN
2 3.63 156384155X 978067 eng 342 38 4.0 NaN
3 3.98 1857237250 978067 eng 383 2197 17.0 NaN
4 0.00 0851742718 978067 eng 49 0 0.0 NaN
I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])