Python: for cycle in range over files

Python: for cycle in range over files - python

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way

You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

Related

Take rows of ith duplicated index and put in i number of dataframes

This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.

Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)

Converting every other csv file column from python list to value

I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?

Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)

When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])

without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()

I have a dataframe where I want to print each row to a different file. When the dataframe consists of e.g. only 50 rows, len(df) will print 50 and iterating over the rows of the dataframe like
for index, row in df.iterrows():
print(index)
will print the index from 0 to 49.
However, if my dataframe contains more than 50'000 rows, len(df)and the number of iterations when iterating over df.iterrows() differ significantly. For example, len(df) will say e.g. 50'554 and printing the index will go up to over 400'000.
How can this be? What am I missing here?

First, as #EdChum noted in the comment, your question's title refers to iterrows, but the example you give refers to iteritems, which loops in the orthogonal direction to that relevant to len. I assume you meant iterrows (as in the title).
Note that a DataFrame's index need not be a running index, irrespective of the size of the DataFrame. For example:
df = pd.DataFrame({'a': [1, 2, 3, 4]}, index=[2, 4, 5, 1000])
>>> for index, row in df.iterrows():
... print index
2
4
5
1000
Presumably, your long DataFrame was just created differently, then, or underwent some manipulation, affecting the index.
If you really must iterate with a running index, you can use Python's enumerate:
>>> for index, row in enumerate(df.iterrows()):
... print index
0
1
2
3
(Note that, in this case, row is itself a tuple.)

Pandas recalculate index after a concatenation

I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.

If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.

After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2

This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: for cycle in range over files - python

Related

Take rows of ith duplicated index and put in i number of dataframes

Converting every other csv file column from python list to value

Python: Extract unique index values and use them in a loop

pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()

Pandas recalculate index after a concatenation

Categories

Resources