Pandas: how to add a dataframe inside a cell of another dataframe? - python

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1

Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

Related

Concatenating lists and removing duplicates

I have three spreadsheets that all have a month/year column. There is some overlap between the spreadsheets (ie one covers 1998 to 2015 and one covers 2012 to 2020). I want to have a combined list of all the months/years with no duplicates. I have achieved this but I feel there must be a cleaner way to do so.
Dataframes are somewhat similar:
Month
VALUE
1998M01
1`
1998M02
2
import pandas as pd
unemp8315 = pd.read_csv('Unemployment 19832015.csv')
unemp9821 = pd.read_csv('Unemployment 19982021.csv')
unempcovid = pd.read_csv('Unemployment Covid.csv')
print(unemp8315)
print(unemp9821)
print(unempcovid)
monthlist = []
for i in unemp8315['Month']:
monthlist.append(i)
monthlist2 = []
for b in unemp9821['Month']:
monthlist2.append(b)
monthlist3 = []
for c in unempcovid['Month']:
monthlist3.append(c)
full_month_list = monthlist + monthlist2 + monthlist3
fullpd = pd.DataFrame(data=full_month_list)
clean_month_list = fullpd.drop_duplicates()
print(clean_month_list)
There's no need to iterate over every single entry, you can easily concatenate the dataframes, select the month column and get rid of the duplicates there
fullpd = pd.concat([unemp8315, unemp9821, unepmcovid], axis=0)
clean_month_list = fullpd['Month'].drop_duplicates()
You can do something like this:
files = ['Unemployment 19832015.csv',
'Unemployment 19982021.csv',
'Unemployment Covid.csv']
dfs = [pd.read_csv(file)["Month"] for file in files]
clean_month_list = pd.concat(dfs).drop_duplicates()
load them into a dictionary instead of a list dict[month] = value ?

In python, how can i use a loop to name panda data frames?

What I'm trying to do is to use pandas to create as many separate data arrays as there are runs of my data set. The approach needs to be vary depending on the data file read in, so I want the run number, the second column, to be used to identify the data and separate it into separate data sets.
So I have a data set that looks like:
1.350000035018e-03 1.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 1.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 1.000000000000e+00 -2.062988281250e-06
(couple hundred lines later)
1.350000035018e-03 2.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 2.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 2.000000000000e+00 -2.062988281250e-06
(however many readings later)
1.350000035018e-03 35.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 35.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 35.000000000000e+00 -2.062988281250e-06
I want to process it into:
data1 = some number 1.0 some number
some number 1.0 some number
data2 = some number 2.0 some number
some number 2.0 some number
datan= some number n some number
some number n some number
So far my code:
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
#observe data format
print(data)
V n I
0 0.001350 1.0 -1.617387e-14
1 0.002850 1.0 -2.752686e-06
2 0.004350 1.0 -2.062988e-06
#count the loops for autamted graph plotting
num = 1
for i in range (len(data)):
if i > 0:
if data['n'][i]> data['n'][i-1]:
num = num + 1
#
print('there are '+str(num)+' runs')
#seperate data based on loop #n
for i in range (num):
run = data.groupby(data.n)
data+str(i) = run.get_group(i)
print(data+str(i))
#
using the data grouping method works, but I cant figure out a way to use the loop number as a name variable, any help/suggestions would be highly appreciated?
Do you need to explicitly name your dataframes or can it be part of a list or dict?
For instance, you could do something like this...
import pandas as pd
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
data_list = []
# get unique run entries
runs = data["n"].unique()
# save each run's corresponding dataframe into data_list
for run in runs:
data_sub = data[data["n"] == run]
data_list.append(data_sub)
# access it by doing something as follows
for idx, run in enumerate(runs):
print("Working on run {}".format(run))
df_to_operate_on = data_list[idx]
I'm not entirely sure I understand correctly what you're trying to achieve. But if you aim to have data like this:
1.350000035018e-03 1 -1.617387196395e-14
2.850000048056e-03 2 -2.752685546875e-06
4.350000061095e-03 3 -2.062988281250e-06
do you really need the n column?
Isn't that just the data.index + 1?
(the index in your example is [0, 1, 2], and you're looking for [1, 2, 3], so you might be able to do something like data.n = [i + 1 for i in data.index])

convert pandas series (with strings) to python list

It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Python pandas empty df but columns has elements

I have really irritating thing in my script and don't have idea what's wrong. When I try to filter my dataframe and then add rows to newone which I want to export to excel this happen.
File exports as empty DF, also print shows me that "report" is empty but when I try to print report.Name, report.Value etc. I got normal and proper output with elements. Also I can only export one column to excel not entire DF which looks like empty.... What can cause that strange accident?
So this is my script:
df = pd.read_excel('testfile2.xlsx')
report = pd.DataFrame(columns=['Type','Name','Value'])
for index, row in df.iterrows():
if type(row[0]) == str:
type_name = row[0].split(" ")
if type_name[0] == 'const':
selected_index = index
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
else:
for elements in type_name:
report['Value'].loc[selected_index] += " " + elements
elif type(row[0]) == float:
df = df.drop(index=index)
print(report) #output - Empty DataFrame
print(report.Name) output - over 500 elements
You are trying to manipulate a series that does not exist which leads to the described behaviour.
Doing what you did just with a way more simple example i get the same result:
report = pd.DataFrame(columns=['Type','Name','Value'])
report['Type'].loc[0] = "A"
report['Name'].loc[0] = "B"
report['Value'].loc[0] = "C"
print(report) #empty df
print(report.Name) # prints "B" in a series
Easy solution: Just add the whole row instead of the three single values:
report = pd.DataFrame(columns=['Type','Name','Value'])
report.loc[0] = ["A", "B", "C"]
or in your code:
report.loc[index] = [type_name[1], type_name[2], row[1]]
If you want to do it the same way you are doing it at the moment you first need to add an empty series with the given index to your DataFrame before you can manipulate it:
report.loc[index] = pd.Series([])
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]

Categories

Resources