issue in creating Data frame by passing list - python

after doing few data manipulation i got 2 list avglist and sumlist
and now i passed this 2 list to my result_df
result_df = pd.DataFrame({"File Name": filelist ,"Average":avglist,"Sum":sumlist})
print(result_df)
so below is my Output result, but problem here is
1) even my header Continental AG, datatype info also include..
i just my my values "874" and 584 in sum needed..
-i tried avglist.value(), but .value is not a list function
also tried few variation in .index but did not get expected result..
am i missing any steps here?

There is something wrong with how you're importing your files. If you take a .sum() of your dataframe, it will give you back the sum of the columns. I suspect you may be doing that since you are summing a dataframe. Then when you try to put the list in another dataframe its looking funky.
lets take the following two dataframes:
df = pd.DataFrame({'a':[1, 20, 30, 4, 0],
'b':[1, 0, 3, 4, 0],
'c':[1, 3, 7, 7, 5],
'd':[1, 8, 3, 8, 5],
'e':[1, 11, 3, 4, 0]})
df2 = pd.DataFrame({'a':[1, 20, 100, 4, 0],
'b':[1, 0, 39, 49, 10],
'c':[1, 3, 97, 7, 95],
'd':[441, 38, 23, 8, 115],
'e':[1, 11, 13, 114, 0]})
looking at the sum of one of these dataframes:
df.sum()
a 55
b 8
c 23
d 25
e 19
dtype: int64
now if we were to take the sums of dataframes and put them in a list:
sums = [x.sum() for x in [df, df2]]
when we inspect this we get:
[a 55
b 8
c 23
d 25
e 19
dtype: int64, a 125
b 99
c 203
d 625
e 139
dtype: int64]
if you want the sum of the whole dataframe and not just by column, you can use .sum().sum() which will sum first by columns and then sum those columns
df.sum().sum()
130
so across dataframes it would be:
sums = [x.sum().sum() for x in [df, df2]]
doing the mean would depend on how your csvs are. if you were to do .mean().mean() that might be very different than what you're looking for. If its just 1 column every time it would be fine. but if it were more, it would be taking the mean of 5 columns, and then taking the mean of that (those 5 averages summed divided by 5).
lastly it looks like "Continental AG (Worldwide)" is the name of your column.
So in your for loop you should be doing:
sums = [df['Continental AG (Worldwide)'.sum() for df in list_dfs]

i performed few operation sometime like below...
while i < len(filepath):
.....
df['Date']=df['Time'].apply(lambda i:i.split('T')[0])
.......
.......
sum1=sum_df.sum(axis=0)
avg1=Avg_df.sum(axis=0)
.......
.......
avglist.append(avg1)
sumlist.append(sum1)
.....
i+=1
so i have changed my all operation to below..
df['Date']=df.iloc[:,0].apply(lambda i:i.split('T')[0])
.........
.........
sum1=sum_df.iloc[:,0].sum()
avg1=Avg_df.iloc[:,0].mean()
.....
.....
avglist.append(avg1)
sumlist.append(sum1)
instead of using column name, axis in my operation.
i updated to dataframe.iloc in all my operation and it started giving me correct result..
still not sure about precise reason , but this changes worked for me..

Related

Pandas - Add a new column extracting value from arrays based on other column value

I am currently stuck trying to extract a value from a list/array depending on values of a dataframe.
Imagine i have this array. This array i can manually create so i can put the numbers in any way i want i just thought this python list was the best one but i can do anything here
value = [[30, 120, 600, 3000], [15, 60, 300, 1500], [30, 120, 600, 3000], [10, 40, 200, 1000],[10, 40, 200, 1000], [10, 40, 200, 1000], [10, 40, 200, 1000], [5, 20, 100, 500]]
I have also a data frame that comes from much bigger/dynamic processing where I have two columns, which are int types. Here a code to recreate those 2 columns as an example.
The array possible values of id1 go from 0 to 6 and of id2 go from 0 to 3
data = {'id1': [4, 2, 6, 6], 'id2': [1, 2, 3, 1]}
df = pd.DataFrame(data)
What i want to do is add an additional column in the dataframe df which is based on the value of the array depending on the two columns.
So for example the first row of data frame will take the value of value[4][1]=40 to end up with a dataframe like this
result = {'id1': [4, 2, 6, 6], 'id2': [1, 2, 3, 1], 'matched value': [40, 600, 1000, 40]}
dfresult = pd.DataFrame(result)
I am a bit lost on what is the best way to achieve this.
What comes to my mind is a very brutal solution where what i can do is take the values of the multidimensional array and just create a single list where I have all the possible 7*4 combinations, in the data frame create a new column which is the concatenation of the two-ids and then do a straight join based on the simple condition. This would likely work in this case because the possible combinations are few but I am certain there is a learning opportunity here to use lists in a dynamic way that escapes me!
You can use list comprehension to iterate over the id pairs and retrieve the corresponding value for each pair
df['matched_val'] = [value[i][j] for i, j in zip(df['id1'], df['id2'])]
Or a better solution with numpy indexing but applicable only if the sub-lists inside value are of equal length:
df['matched_val'] = np.array(value)[df['id1'], df['id2']]
Result
id1 id2 matched_val
0 4 1 40
1 2 2 600
2 6 3 1000
3 6 1 40

How to concatenate columns and pivot keeping columns information in pandas

I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.

concatenate in place in sub function with pandas concat function?

I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Idiomatic way of only selecting certain rows from a dataframe (whose index exists in other dataframes)

So I have two pandas timeseries, and the indexes on both are timestamps. The thing is - not all of the timestamps exist on both timeseries. I want to perform a linear regression on the points that are matched up, ignoring those which have no 'pair'
This is my current solution, but it seems somewhat verbose and ugly:
indexes_used = sorted(list(set(series1).intersection(series2)))
perform_regression(series1.loc[indexes_used], series2.loc[indexes_used])
Alternatively, I was thinking of doing (but creating a temporary dataframe seems redundant):
temp_frame = pd.concat([series1, series2]).T.dropna() #need the transpose to keep timestamps on vertical axis
perform_regression(blabla)
Is there a good way to do this?
How about Series.align:
import pandas as pd
a = pd.Series([4, 5, 6, 7], index=[1, 2, 3, 4])
b = pd.Series([49, 54, 62, 74], index=[2, 6, 4, 0])
a2, b2 = a.align(b, join="inner")
the output:
2 5
4 7
dtype: int64
2 49
4 62
dtype: int64

Categories

Resources