I am new to pandas, and I would appreciate any help. I have a pandas dataframe that comes from csv file. The data contains 2 columns : dates and cashflows. Is it possible to convert these list into list comprehension with tuples inside the list? Here how my dataset looks like:
2021/07/15 4862.306832
2021/08/15 3474.465543
2021/09/15 7121.260118
The desired output is :
[(2021/07/15, 4862.306832),
(2021/08/15, 3474.465543),
(2021/09/15, 7121.260118)]
use apply with lambda function
data = {
"date":["2021/07/15","2021/08/15","2021/09/15"],
"value":["4862.306832","3474.465543","7121.260118"]
}
df = pd.DataFrame(data)
listt = df.apply(lambda x:(x["date"],x["value"]),1).tolist()
Output:
[('2021/07/15', '4862.306832'),
('2021/08/15', '3474.465543'),
('2021/09/15', '7121.260118')]
Related
I'm new to pandas and I want to know if there is a way to map a column of lists in a dataframe to values stored in a dictionary.
Lets say I have the dataframe 'df' and the dictionary 'dict'. I want to create a new column named 'Description' in the dataframe where I can see the description of the Codes shown. The values of the items in the column should be stored in a list as well.
import pandas as pd
data = {'Codes':[['E0'],['E0','E1'],['E3']]}
df = pd.DataFrame(data)
dic = {'E0':'Error Code', 'E1':'Door Open', 'E2':'Door Closed'}
Most efficient would be to use a list comprehension.
df['Description'] = [[dic.get(x, None) for x in l] for l in df['Codes']]
output:
Codes Description
0 [E0] [Error Code]
1 [E0, E1] [Error Code, Door Open]
2 [E3] [None]
If needed you can post-process to replace the empty lists with NaN, use an alternative list comprehension to avoid non-matches: [[dic[x] for x in l if x in dic] for l in df['Codes']], but this would probably be ambiguous if you have one no-match among several matches (which one is which?).
I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.
I have a list of DataFrame names stored in a list like this:
target_dfs = []
for x in np.arange(1950, 2020) :
target_dfs.append('df_stat_data_' + str(x))
This yields a list of strings. But the actual DataFrames with those names do exist.
How do I effectively use each value on the list as a DataFrame and do operations such as dropping the last 3 rows?
I am trying to avoid doing something like this:
df_stats_data_1950 = df_stats_data_1950.iloc[:-3]
...
df_stats_data_2020 = df_stats_data_2020.iloc[:-3]
Make it a dictionary:
target_dfs = {1950: df1, 1951: df2}
You can now do stuff like:
for x in np.arange(1950, 2020):
target_dfs[x].iloc[:-3]
I am a newbie to python. I am trying iterate over rows of individual columns of a dataframe in python. I am trying to create an adjacency list using the first two columns of the dataframe taken from csv data (which has 3 columns).
The following is the code to iterate over the dataframe and create a dictionary for adjacency list:
df1 = pd.read_csv('person_knows_person_0_0_sample.csv', sep=',', index_col=False, skiprows=1)
src_list = list(df1.iloc[:, 0:1])
tgt_list = list(df1.iloc[:, 1:2])
adj_list = {}
for src in src_list:
for tgt in tgt_list:
adj_list[src] = tgt
print(src_list)
print(tgt_list)
print(adj_list)
and the following is the output I am getting:
['933']
['4139']
{'933': '4139'}
I see that I am not getting the entire list when I use the list() constructor.
Hence I am not able to loop over the entire data.
Could anyone tell me where I am going wrong?
To summarize, Here is the input data:
A,B,C
933,4139,20100313073721718
933,6597069777240,20100920094243187
933,10995116284808,20110102064341955
933,32985348833579,20120907011130195
933,32985348838375,20120717080449463
1129,1242,20100202163844119
1129,2199023262543,20100331220757321
1129,6597069771886,20100724111548162
1129,6597069776731,20100804033836982
the output that I am expecting:
933: [4139,6597069777240, 10995116284808, 32985348833579, 32985348838375]
1129: [1242, 2199023262543, 6597069771886, 6597069776731]
Use groupby and create Series of lists and then to_dict:
#selecting by columns names
d = df1.groupby('A')['B'].apply(list).to_dict()
#seelcting columns by positions
d = df1.iloc[:, 1].groupby(df1.iloc[:, 0]).apply(list).to_dict()
print (d)
{933: [4139, 6597069777240, 10995116284808, 32985348833579, 32985348838375],
1129: [1242, 2199023262543, 6597069771886, 6597069776731]}
I want to transform this Scala code in Pyspark code.
Scala Code:
Row={
val columnArray = new Array[String](95)
columnArray(0)=x.substring(0,10)
columnArray(1)=x.substring(11,14)
columnArray(2)=x.substring(15,17)
Row.fromSeq(columnArray)
}
How elaborate same scala code on pyspark?
#Felipe Avalos
#Nicolas GreniƩ
Assuming you are trying to convert an array of strings to a data frame with substrings as the corresponding columns this will do the trick in pyspark.
Change the column_array to have the array of strings and the column_names to have the names of each column:
column_array = ["abcdefghijklmnopqrst", "abcdefghijklmnopqrst"]
column_names = ["col1", "col2", "col3", "col4"]
This will convert map the array to an rdd with the strings and substrings as the value. The rdd is then converted to a data frame with the column names given.
sc.parallelize(column_array).map(lambda x: (x, x[0:10], x[11:14],
x[15:17])).toDF(column_names).show()
This will generate the following data frame:
+--------------------+----------+----+----+
| col1| col2|col3|col4|
+--------------------+----------+----+----+
|abcdefghijklmnopqrst|abcdefghij| lmn| pq|
|abcdefghijklmnopqrst|abcdefghij| lmn| pq|
+--------------------+----------+----+----+