I am a newbie to python. I am trying iterate over rows of individual columns of a dataframe in python. I am trying to create an adjacency list using the first two columns of the dataframe taken from csv data (which has 3 columns).
The following is the code to iterate over the dataframe and create a dictionary for adjacency list:
df1 = pd.read_csv('person_knows_person_0_0_sample.csv', sep=',', index_col=False, skiprows=1)
src_list = list(df1.iloc[:, 0:1])
tgt_list = list(df1.iloc[:, 1:2])
adj_list = {}
for src in src_list:
for tgt in tgt_list:
adj_list[src] = tgt
print(src_list)
print(tgt_list)
print(adj_list)
and the following is the output I am getting:
['933']
['4139']
{'933': '4139'}
I see that I am not getting the entire list when I use the list() constructor.
Hence I am not able to loop over the entire data.
Could anyone tell me where I am going wrong?
To summarize, Here is the input data:
A,B,C
933,4139,20100313073721718
933,6597069777240,20100920094243187
933,10995116284808,20110102064341955
933,32985348833579,20120907011130195
933,32985348838375,20120717080449463
1129,1242,20100202163844119
1129,2199023262543,20100331220757321
1129,6597069771886,20100724111548162
1129,6597069776731,20100804033836982
the output that I am expecting:
933: [4139,6597069777240, 10995116284808, 32985348833579, 32985348838375]
1129: [1242, 2199023262543, 6597069771886, 6597069776731]
Use groupby and create Series of lists and then to_dict:
#selecting by columns names
d = df1.groupby('A')['B'].apply(list).to_dict()
#seelcting columns by positions
d = df1.iloc[:, 1].groupby(df1.iloc[:, 0]).apply(list).to_dict()
print (d)
{933: [4139, 6597069777240, 10995116284808, 32985348833579, 32985348838375],
1129: [1242, 2199023262543, 6597069771886, 6597069776731]}
Related
I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
I have a very large dictionary of dataframes. It contains around 250 dataframes, each of which has around 50 columns per df. My goal is to concat the dataframes to create one large df; however, as you can imagine, this process isn't great because it will create a df that is way too large view outside of using python.
My goal is to explode the large dictionary of df in half and turn it into two large, but manageable files.
I will try to replicate what it looks like:
d = {df1, df2,........,df500}
df = pd.concat(d)
# However, Is there a way to split 50%?
df1 = pd.concat(d) # only gets first 250 of the df
df2 =pd.concat(d) # only gets last 250 df
How about something like this?
v = list(d.values())
part1 = v[:len(v)//2]
part2 = v[len(part1):]
df1 = pd.concat(part1)
df2 = pd.concat(part2)
First of all it's not a dictionary , it's a set which can be converted to list.
An List can be divided into 2 as you need.
d=list(d)
ln=len(d)
d1=d[0:ln//2]
d2=d[ln//2:]
df1 = pd.concat(d1)
df2 = pd.concat(d2)
There is an excel file that I have to read and do some process on every two columns of it, and eventually concatenate them vertically in a two columns dataframe.
I code the process for the column 0 and 1,
Now I am struggling to generate a function to do the same for all the two sets of columns.
I first chose the first two columns from excel as bellow
data1 = pd.read_excel('Book2.xlsx', usecols=[0,1],parse_dates=True)
How can I generate data1 to data5 from columns(0,1) (2,3) (4,5) (6,7) and do pd.concat([data1,data2,data3,data4,data5])
I can’t shear the Excel file,however it looks like
stocks = [('2021-01-04', 113.4377, '2021-01-04','Nan'),
('2021-01-07', 125.8316, '2021-01-07',127.8212),
('2021-01-14', 108.4792, '2021-01-14',111.0318),
('2021-01-21', 99.584, '2021-01-21',144.6129),
]
df = pd.DataFrame(stocks,columns =['DateA', 'StockA', 'DateB', 'StockB'])
df
You can read the entire dataset, loop on df.columns and save all datasets in a dictionary:
data = pd.read_excel('Book2.xlsx', parse_dates=True)
d={}
for i in range(5):
d['data_' + str(i+1)] = data[data.columns[i*2:i*2+2]]
And finally concat all the datasets:
result = pd.concat([i for i in d.keys()])
I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.
For my project I'm reading in a csv file with data from every State in the US. My function converts each of these into a separate Dataframe as I need to perform operations on each State's information.
def RanktoDF(csvFile):
df = pd.read_csv(csvFile)
df = df[pd.notnull(df['Index'])] # drop all null values
df = df[df.Index != 'Index'] #Drop all extra headers
df= df.set_index('State') #Set State as index
return df
I apply this function to every one of my files and return the df with a name from my array varNames
for name , s in zip (glob.glob('*.csv'), varNames):
vars()["Crime" + s] = RanktoDF(name)
All of that works perfectly.
My problem is that I also want to create a Dataframe thats made up of one column from each of those State Dataframes.
I have tried iterating through a list of my dataframes and selecting the column (population) i want to append it to a new Dataframe:
dfList
dfNewIndex = pd.DataFrame(index=CrimeRank_1980_df.index) # Create new DF with Index
for name in dfList: #dfList is my list of dataframes. See image
newIndex = name['Population']
dfNewIndex.append(newIndex)
#dfNewIndex = pd.concat([dfNewIndex, dfList[name['Population']], axis=1)
My error is always the same which tells me that name is viewed as a string rather than an actual Dataframe
TypeError Traceback (most recent call last)
<ipython-input-30-5aa85b0174df> in <module>()
3
4 for name in dfList:
----> 5 newIndex = name['Index']
6 dfNewIndex.append(newIndex)
7 # dfNewIndex = pd.concat([dfNewIndex, dfList[name['Population']], axis=1)
TypeError: string indices must be integers
I understand that my list is a list of Strings rather than variables/dataframes so my question is how can i correct my code to be able to do what i want or is there an easier way of doing this?
Any solutions I've looked up have given answers where the dataframes are explicitly typed in order to be concatenated but I have 50 so its a little unfeasible. Any help would be appreciated.
One way would be to index into vars(), e.g.
for name in dfList:
newIndex = vars()[name]["Population"]
Alternatively I think it would be neater to store your dataframes in a container and iterate through that, e.g.
frames = {}
for name, s in zip(glob.glob('*.csv'), varNames):
frames["Crime" + s] = RanktoDF(name)
for name in frames:
newIndex = frames[name]["Population"]